Data Mining Models and Evaluation Techniques

Similar documents
Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Parse trees, ambiguity, and Chomsky normal form

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Classification Part 4. Model Evaluation

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

p-adic Egyptian Fractions

Bayesian Networks: Approximate Inference

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Acceptance Sampling by Attributes

Convert the NFA into DFA

Math 1B, lecture 4: Error bounds for numerical methods

Designing Information Devices and Systems I Spring 2018 Homework 7

Interpreting Integrals and the Fundamental Theorem

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Lecture 3: Equivalence Relations

CS667 Lecture 6: Monte Carlo Integration 02/10/05

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Continuous Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

1B40 Practical Skills

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Section 6: Area, Volume, and Average Value

Chapter 0. What is the Lebesgue integral about?

For the percentage of full time students at RCC the symbols would be:

Recitation 3: More Applications of the Derivative

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

5: The Definite Integral

Classification: Rules. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo regionale di Como

Tests for the Ratio of Two Poisson Rates

CHAPTER 1 PROGRAM OF MATRICES

Non-Linear & Logistic Regression

Bases for Vector Spaces

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

Review of Calculus, cont d

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

Section 4: Integration ECO4112F 2011

7.1 Integral as Net Change and 7.2 Areas in the Plane Calculus

Chapter 4: Techniques of Circuit Analysis. Chapter 4: Techniques of Circuit Analysis

Math 8 Winter 2015 Applications of Integration

A study of Pythagoras Theorem

Things to Memorize: A Partial List. January 27, 2017

Student Activity 3: Single Factor ANOVA

The practical version

Review of Gaussian Quadrature method

Quadratic Forms. Quadratic Forms

The steps of the hypothesis test

1 Online Learning and Regret Minimization

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Designing finite automata II

2.4 Linear Inequalities and Interval Notation

Designing Information Devices and Systems I Spring 2018 Homework 8

Section 6.1 Definite Integral

Improper Integrals. The First Fundamental Theorem of Calculus, as we ve discussed in class, goes as follows:

Fast Frequent Free Tree Mining in Graph Databases

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2.

CM10196 Topic 4: Functions and Relations

Week 10: Line Integrals

Name Ima Sample ASU ID

Minimal DFA. minimal DFA for L starting from any other

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Continuous Random Variable X:

Monte Carlo method in solving numerical integration and differential equation

QUADRATURE is an old-fashioned word that refers to

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

7.2 The Definite Integral

19 Optimal behavior: Game theory

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

1 Nondeterministic Finite Automata

New data structures to reduce data size and search time

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

4.4 Areas, Integrals and Antiderivatives

Nondeterminism and Nodeterministic Automata

10. AREAS BETWEEN CURVES

Operations with Polynomials

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

MA123, Chapter 10: Formulas for integrals: integrals, antiderivatives, and the Fundamental Theorem of Calculus (pp.

Chapters Five Notes SN AA U1C5

Rudimentary Matrix Algebra

Model Reduction of Finite State Machines by Contraction

0.1 THE REAL NUMBER LINE AND ORDER

The Regulated and Riemann Integrals

Lecture 08: Feb. 08, 2019

13: Diffusion in 2 Energy Groups

1 Probability Density Functions

Homework Solution - Set 5 Due: Friday 10/03/08

Chapters 4 & 5 Integrals & Applications

DFA minimisation using the Myhill-Nerode theorem

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Review of Probability Distributions. CS1538: Introduction to Simulations

How can we approximate the area of a region in the plane? What is an interpretation of the area under the graph of a velocity function?

Formal languages, automata, and theory of computation

Transcription:

Dt Mining Models nd Evlution Techniques Shuhm Pchori 12BCE55 DEPARTMENT OF COMPUTER ENGINEERING AHMEDABAD -382424 Novemer 214

Dt Mining Models And Evlution Techniques Seminr Sumitted in prtil fulfillment of the requirements For the degree of Bchelor of Technology in Computer Science nd Engineering Shuhm Pchori 12BCE55 DEPARTMENT OF COMPUTER ENGINEERING AHMEDABAD -382424 Novemer 214

Dt Mining Models And Evlution Technique CERTIFICATE This is to certify tht the seminr entitled Dt Mining Models nd Evlution Techniques sumitted y Shuhm Pchori (12BCE55), towrds the prtil fulfillment of the requirements for the degree of Bchelor of Technology in Computer Science And Engineering Nirm University, Ahmedd is the record of work crried out y him under my supervision nd guidnce. In my opinion, the sumitted work hs reched level required for eing ccepted for exmintion. The results emodied in this Seminr, to the est of my knowledge, hvent een sumitted to ny other university or institution for wrd of ny degree or diplom. Prof. K.P.Agrwl Associte Professor, CSE Deprtment, Institute Of Technology, Nirm University, Ahmedd. Prof. Anuj Nir Assistnt Professor, CSE Deprtment, Institute Of Technology, Nirm University, Ahmedd. Dr. Snjy Grg Prof & Hed Of Deprtment, CSE Deprtment, Institute Of Technology, Nirm University, Ahmedd. CSE Deprtment,Institute of Technology, Nirm University i

Acknowledgements Dt Mining Models And Evlution Technique I m profoundly grteful to Prof. K P AGARWAL for his expert guidnce throughout the project.his continuous encourgement hve fetched us the golden results. His elixir of knowledge in the field hs mde this project chieve its zenith nd crediility. I would like to express deepest pprecition towrds, Prof. SANJAY GARG, Hed of Deprtment of Computer Engineering nd Prof. ANUJA NAIR, whose invlule guidnce supported us in completing this project. At lst I must express my sincere hertfelt grtitude to ll the stff memers of Computer Engineering Deprtment who helped me directly or indirectly during this course of work. SHUBHAM PACHORI 12BCE55 CSE Deprtment,Institute of Technology, Nirm University ii

Astrct Dt Mining Models And Evlution Technique Dtses re rich with hidden informtion tht cn e used for intelligent decision mking. Clssifiction nd prediction re two forms of dt nlysis tht cn e used to extrct models descriing importnt dt clsses or to predict future dt trends. Such nlysis cn help provide us with etter understnding of the dt t lrge. Clssifiction models predicts ctegoricl (discrete, unordered) lel functions. For exmple, we cn uild clssifiction model to ctegorize nk lon pplictions s either sfe or risky. As predictions lwys hve n implicit cost involved it is importnt to evlute clssifiers generliztion performnce in order to determine whether to employ the clssifier (For exmple: when lerning the effectiveness of medicl tretments from limited-size dt, it is importnt to estimte the ccurcy of the clssifiers.) nd to Optimize the clssifier. (For exmple: when post-pruning decision trees we must evlute the ccurcy of the decision trees on ech pruning step.) This seminr report gives n in depth explntion of clssifier models (viz. Nive Byesin nd Decision Trees) nd how these clssifier models re evluted for their ccurcy on predictions. The lter prt of the report lso dels with how to improve the ccurcy of these clssifier models nd it includes n explortory study compring the vrious model evlution techniques, crried out in Wek(A GUI Bsed Dt Mining Tool) on representtive dt sets. CSE Deprtment,Institute of Technology, Nirm University iii

Dt Mining Models And Evlution Technique Contents Certificte.................................. ii Acknowledgements............................ iii Astrct................................... iv 1 Introduction 2 2 Clssifiction Using Decision Tree 4 2.1 Understnding Decision Trees.................... 4 2.2 Divide nd Conquer.......................... 5 2.3 C5. Decision Tree Algorithm.................... 8 2.4 How To Choose The Best Split?................... 9 2.5 Pruning The Decision Tree...................... 1 3 Proilistic Lerning - Nive Byesin Clssifiction 12 3.1 Understnding Nive Byesin Clssifiction............ 12 3.2 Byes Theorem............................ 13 3.3 The Nive Byes Algorthim..................... 14 3.4 Nive Byesin Clssifiction.................... 15 4 Model Evlution Techniques 17 4.1 Prediction Accurcy.......................... 17 4.2 Confusion Mtrix nd Model Evlution Metrics.......... 18 4.3 How To Estimte These Metrics?................... 23 4.3.1 Trining nd Independent Test Dt............. 23 4.3.2 Holdout Method....................... 24 4.3.3 K-Cross-vlidtion...................... 25 4.3.4 Bootstrp........................... 26 4.3.5 Compring Two Clssifier Models.............. 27 4.4 ROC Curves.............................. 29 4.5 Ensemle Methods.......................... 31 4.5.1 Why Ensemle Works?.................... 31 CSE Deprtment,Institute of Technology, Nirm University iv

Dt Mining Models And Evlution Technique 4.5.2 Ensemle Works in Two Wys................ 32 4.5.3 Lern To Comine...................... 33 4.5.4 Lern By Consensus..................... 33 4.5.5 Bgging............................ 33 4.5.6 Boosting........................... 34 5 Conclusion nd Future Scope 37 5.1 Comprtive Study........................... 37 5.2 Conclusion.............................. 4 5.3 Future Scope............................. 4 References 4 CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd v

Dt Mining Models And Evlution Technique Chpter 1 Introduction The term Knowledge Discovery in Dtses, or KDD for short, refers to the rod process of finding knowledge in dt, nd emphsizes the high-level ppliction of prticulr dt mining methods. It is of interest to reserchers in mchine lerning, pttern recognition, dtses, sttistics, rtificil intelligence, knowledge cquisition for expert systems, nd dt visuliztion.the unifying gol of the KDD process is to extrct knowledge from dt in the context of lrge dtses.it does this y using dt mining methods (lgorithms) to extrct (identify) wht is deemed knowledge, ccording to the specifictions of mesures nd thresholds, using dtse long with ny required preprocessing, susmpling, nd trnsformtions of tht dtse. The overll process of finding nd interpreting ptterns from dt involves the repeted ppliction of the following steps: Figure 1.1: KDD Process 1. Developing n understnding of 1. the ppliction domin CSE Deprtment,Institute of Technology, Nirm University 1

Dt Mining Models And Evlution Technique 2. the relevnt prior knowledge 3. the gols of the end-user 2. Creting trget dt set: selecting dt set, or focusing on suset of vriles, or dt smples, on which discovery is to e performed. 3. Dt clening nd preprocessing. 1. Removl of noise or outliers. 2. Removl of noise or outliers. 3. Strtegies for hndling missing dt fields. 4.Accounting for time sequence informtion nd known chnges. 4. Dt reduction nd projection. 1.Finding useful fetures to represent the dt depending on the gol of the tsk. 2.Using dimensionlity reduction or trnsformtion methods to reduce the effective numer of vriles under considertion or to find invrint representtions for the dt. 5. Choosing the dt mining tsk.deciding whether the gol of the KDD process is clssifiction, regression, clustering, etc. 6. Choosing the dt mining lgorithm 1. Selecting method(s) to e used for serching for ptterns in the dt. 2. Deciding which models nd prmeters my e pproprite. 3. Mtching prticulr dt mining method with the overll criteri of the KDD process. 7. Dt mining. Serching for ptterns of interest in prticulr representtionl form or set of such representtions s clssifiction rules or trees, regression, clustering, nd so forth. 8. Interpreting mined ptterns. 9. Consolidting discovered knowledge In the following chpters we will e exploring Dt Mining Models nd Evlution techniques in depth CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 2

Dt Mining Models And Evlution Technique Chpter 2 Clssifiction Using Decision Tree This chpter introduces the concept of the most widely used lerning method tht pply similr strtegy of dividing dt into smller nd smller portions to identify ptterns tht cn e used for prediction. The knowledge is then presented in the form of logicl structures tht cn e understood without ny sttisticl knowledge. This spect mkes these models prticulrly useful for usiness strtegy nd process improvement. 1. Understding Decision Tress 2. Divide nd Conquer 3. Unique identifiers 4. C 5. Decision Tree Algorithm 5. Choosing The Best Split 6. Pruning The Decision Tress 2.1 Understnding Decision Trees As we might intuit from the nme itself, decision tree lerners uild model in the form of tree structure. The model itself comprises series of logicl decisions, similr to flowchrt, with decision nodes tht indicte decision to e mde on n ttriute. These split into rnches tht indicte the decision s choices. The tree is terminted y lef nodes (lso known s terminl nodes) tht denote the result of following comintion of decisions. Dt tht is to e clssified egin t the root node where it is pssed through the vrious decisions in the tree ccording to the vlues of its fetures. The pth tht the dt tkes funnels ech record into lef node, which ssigns it predicted CSE Deprtment,Institute of Technology, Nirm University 3

Dt Mining Models And Evlution Technique clss. As the decision tree is essentilly flowchrt, it is prticulrly pproprite for pplictions in which the clssifiction mechnism needs to e trnsprent for legl resons or the results need to e shred in order to fcilitte decision mking. Some potentil uses include: 1. Credit scoring models in which the criteri tht cuses n pplicnt to e rejected need to e well-specified 2. Mrketing studies of customer churn or customer stisfction tht will e shred with mngement or dvertising gencies 3. Dignosis of medicl conditions sed on lortory mesurements, symptoms, or rte of disese progression In spite of their wide pplicility, it is worth noting some scenrios where trees my not e n idel fit. One such cse might e tsk where the dt hs lrge numer of nominl fetures with mny levels or if the dt hs lrge numer of numeric fetures. These cses my result in very lrge numer of decisions nd n overly complex tree. 2.2 Divide nd Conquer Decision trees re uilt using heuristic clled recursive prtitioning. This pproch is generlly known s divide nd conquer ecuse it uses the feture vlues to split the dt into smller nd smller susets of similr clsses. Beginning t the root node, which represents the entire dtset, the lgorithm chooses feture tht is the most predictive of the trget clss. The exmples re then prtitioned into groups of distinct vlues of this feture; this decision forms the first set of tree rnches. The lgorithm continues to divide-nd-conquer the nodes, choosing the est cndidte feture ech time until stopping criterion is reched. This might occur t node if: 1. All (or nerly ll) of the exmples t the node hve the sme clss 2. There re no remining fetures to distinguish mong exmples 3. The tree hs grown to predefined size limit To illustrte the tree uilding process, let s consider simple exmple. Imgine tht we re working for Hollywood film studio, nd our desk is piled high with CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 4

Dt Mining Models And Evlution Technique screenplys. Rther thn red ech one cover-to-cover, you decide to develop decision tree lgorithm to predict whether potentil movie would fll into one of three ctegories: minstrem hit, critic s choice, or ox office ust. To gther dt for your model, we turn to the studio rchives to exmine the previous ten yers of movie releses. After reviewing the dt for 3 different movie scripts, pttern emerges. There seems to e reltionship etween the film s proposed shooting udget, the numer of A-list celerities lined up for strring roles, nd the ctegories of success. A sctter plot of this dt might look something like the figure 2.1(Reference [2]): Figure 2.1: Sctter Plot of Budget vs A-List(Ref[2]) Celeerities To uild simple decision tree using this dt, we cn pply divide-nd-conquer strtegy. Let s first split the feture indicting the numer of celerities, prtitioning the movies into groups with nd without low numer of A-list strs (fig 2.2 Reference [2]) Figure 2.2: Split 1: Sctter Plot of Budget vs A-List Celeerities (Ref[2]) CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 5

Dt Mining Models And Evlution Technique Next, mong the group of movies with lrger numer of celerities, we cn mke nother split etween movies with nd without high udget(fig2.3) At this point we hve prtitioned the dt into three groups. The group t the top-left corner of the digrm is composed entirely of criticlly-cclimed films. This group is distinguished y high numer of celerities nd reltively low udget. At the top-right corner, the mjority of movies re ox office hits, with high udgets nd lrge numer of celerities. The finl group, which hs little str power ut udgets rnging from smll to lrge, contins the flops. Figure 2.3: Split 2: Sctter Plot of Budget vs A-List Celeerities (Ref[2]) If we wnted, we could continue to divide the dt y splitting it sed on incresingly specific rnges of udget nd celerity counts until ech of the incorrectly clssified vlues resides in its own, perhps tiny prtition. Since the dt cn continue to e split until there re no distinguishing fetures within prtition, decision tree cn e prone to e overfitting for the trining dt with overly-specific decisions. We ll void this y stopping the lgorithm here since more thn 8 percent of the exmples in ech group re from single clss. Our model for predicting the future success of movies cn e represented in simple tree s shown fig 2.4(Ref[2]). To evlute script, follow the rnches through ech decision until its success or filure hs een predicted. In no time, you will e le to clssify the cklog of scripts nd get ck to more importnt work such s writing n wrds cceptnce speech. Since rel-world dt contins more thn two fetures, decision trees quickly ecome fr more complex thn this, with mny more nodes, rnches, nd leves. In the next section we will throw some light on populr lgorithm for uilding decision tree models utomticlly. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 6

Dt Mining Models And Evlution Technique Figure 2.4: Decision Tree Model(Refernence[2]) 2.3 C5. Decision Tree Algorithm There re numerous implementtions of decision trees, ut one of the most well known is the C5. lgorithm. This lgorithm ws developed y computer scientist J. Ross Quinln s n improved version of his prior lgorithm, C4.5, which itself is n improvement over his ID3 (Itertive Dichotomiser 3) lgorithm. Strengths of C5. Algorithm 1. An ll-purpose clssifier tht does well on most prolems 2. Highly-utomtic lerningprocess cn hndle numeric or nominl fetures, missing dt 3. Uses only the most importnt fetures 4. Cn e used on dt with reltively few trining exmples or very lrge numer 5. Results in model tht cn e interpreted without mthemticl ckground (for reltively smll trees) 6. More efficient thn other complex models Weknesses of C5. Algorithm CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 7

Dt Mining Models And Evlution Technique 1. Decision tree models re often ised towrd splits on fetures hving lrge numer of levels 2. It is esy to overfit or underfit the model 3. Cn hve troule modeling some reltionships due to relince on xisprllel splits 4. Smll chnges in trining dt cn result in lrge chnges to decision logic 5. Lrge trees cn e difficult to interpret nd the decisions they mke my seem counterintuitive 2.4 How To Choose The Best Split? The first chllenge tht decision tree will fce is to identify which feture to split upon. In the previous exmple, we looked for feture vlues tht split the dt in such wy tht prtitions contined exmples primrily of single clss. If the segments of dt contin only single clss, they re considered pure. There re mny different mesurements of purity for identifying splitting criteri C5. uses Entropy for mesuring purity. The entropy of smple of dt indictes how mixed the clss vlues re; the minimum vlue of indictes tht the smple is completely homogenous, while 1 indictes the mximum mount of disorder. The definition of entropy is specified y: Entropy(S) = c p i log 2 (p i ) (2.1) n=1 In the entropy formul, for given segment of dt (S), the term c refers to the numer of different clss levels, nd pi refers to the proportion of vlues flling into clss level i. For exmple, suppose we hve prtition of dt with two clsses: red (6 percent), nd white (4 percent). We cn clculte the entropy s: Entropy(S) =.6 log 2 (.6).4 log 2 (.4) =.97956 (2.2) Given this mesure of purity, the lgorithm must still decide which feture to split upon. For this, the lgorithm uses entropy to clculte the chnge in homogeneity resulting from split on ech possile feture. The clcultion is known s informtion gin. The informtion gin for feture F is clculted s the difference etween the entropy in the segment efore the split (S 1 ), nd the prtitions resulting CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 8

Dt Mining Models And Evlution Technique from the split (S 2 ): In f ogin(f) = Entropy(S 1 ) Entropy(S 2 ) (2.3) The one compliction is tht fter split, the dt is divided into more thn one prtition. Therefore, the function to clculte Entropy(S 2 ) needs to consider the totl entropy cross ll of the prtitions. It does this y weighing ech prtition s entropy y the proportion of records flling into tht prtition. This cn e stted in formul s: Entropy(S) = n w i log 2 (P i ) (2.4) n=1 In simple terms, the totl entropy resulting from split is the sum of entropy of ech of the n prtitions weighted y the proportion of exmples flling in tht prtition w i. The higher the informtion gin, the etter feture is t creting homogeneous groups fter split on tht feture. If the informtion gin is zero, there is no reduction in entropy for splitting on this feture. On the other hnd, the mximum informtion gin is equl to the entropy prior to the split. This would imply the entropy fter the split is zero, which mens tht the decision results in completely homogeneous groups. The previous formule ssume nominl fetures, ut decision trees use informtion gin for splitting on numeric fetures s well. A common prctice is testing vrious splits tht divide the vlues into groups greter thn or less thn threshold. This reduces the numeric feture into two-level ctegoricl feture nd informtion gin cn e clculted esily. The numeric threshold yielding the lrgest informtion gin is chosen for the split 2.5 Pruning The Decision Tree A decision tree cn continue to grow indefinitely, choosing splitting fetures nd dividing into smller nd smller prtitions until ech exmple is perfectly clssified or the lgorithm runs out of fetures to split on. However, if the tree grows overly lrge, mny of the decisions it mkes will e overly specific nd the model will hve een over fitted to the trining dt. The process of pruning decision tree involves reducing its size such tht it generlizes etter to unseen dt. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 9

Dt Mining Models And Evlution Technique One solution to this prolem is to stop the tree from growing once it reches certin numer of decisions or if the decision nodes contin only smll numer of exmples. This is clled erly stopping or pre-pruning the decision tree. As the tree voids doing needless work, this is n ppeling strtegy. However, one downside is tht there is no wy to know whether the tree will miss sutle, ut importnt ptterns tht it would hve lerned hd it grown to lrger size. An lterntive, clled post-pruning involves growing tree tht is too lrge, then using pruning criteri sed on the error rtes t the nodes to reduce the size of the tree to more pproprite level. This is often more effective pproch thn prepruning ecuse it is quite difficult to determine the optiml depth of decision tree without growing it first. Pruning the tree lter on llows the lgorithm to e certin tht ll importnt dt structures were discovered. One of the enefits of the C5. lgorithm is tht it is opinionted out pruningit tkes cre of mny of the decisions, utomticlly using firly resonle defults. Its overll strtegy is to postprune the tree. It first grows lrge tree tht overfits the trining dt. Lter, nodes nd rnches tht hve little effect on the clssifiction errors re removed. In some cses, entire rnches re moved further up the tree or replced y simpler decisions. These processes of grfting rnches re known s sutree rising nd sutree replcement, respectively. Blncing overfitting nd underfitting decision tree is it of n rt, ut if model ccurcy is vitl it my e worth investing some time with vrious pruning options to see if it improves performnce on the test dt. As you will soon see, one of the strengths of the C5. lgorithm is tht it is very esy to djust the trining options. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 1

Dt Mining Models And Evlution Technique Chpter 3 Proilistic Lerning - Nive Byesin Clssifiction When meteorologist provides wether forecst, precipittion is typiclly predicted using terms such s 7 percent chnce of rin. These forecsts re known s proility of precipittion reports. Hve you ever considered how they re clculted? It is puzzling question, ecuse in relity, it will either rin or it will not. This chpter covers mchine lerning lgorithm clled nive Byes, which lso uses principles of proility for clssifiction. Just s meteorologists forecst wether, nive Byes uses dt out prior events to estimte the proility of future events. For instnce, common ppliction of nive Byes uses the frequency of words in pst junk emil messges to identify new junk mil. 3.1 Understnding Nive Byesin Clssifiction The sic sttisticl ides necessry to understnd the nive Byes lgorithm hve een round for centuries. The technique descended from the work of the 18th century mthemticin Thoms Byes, who developed foundtionl mthemticl principles (now known s Byesin methods) for descriing the proility of events, nd how proilities should e revised in light of dditionl informtion. Clssifiers sed on Byesin methods utilize trining dt to clculte n oserved proility of ech clss sed on feture vlues. When the clssifier is used lter on unleled dt, it uses the oserved proilities to predict the most likely clss for the new fetures. It s simple ide, ut it results in method tht often hs results on pr with more sophisticted lgorithms. In fct, Byesin clssifiers hve een used for: 1. Text clssifiction, such s junk emil (spm) filtering, uthor identifiction, or topic ctegoriztio CSE Deprtment,Institute of Technology, Nirm University 11

Dt Mining Models And Evlution Technique 2. Intrusion detection or nomly detection in computer networks 3. Dignosing medicl conditions, when given set of oserved symptoms Typiclly, Byesin clssifiers re est pplied to prolems in which the informtion from numerous ttriutes should e considered simultneously in order to estimte the proility of n outcome. While mny lgorithms ignore fetures tht hve wek effects, Byesin methods utilize ll ville evidence to sutly chnge the predictions. If lrge numer of fetures hve reltively minor effects, tken together their comined impct could e quite lrge. 3.2 Byes Theorem Byes theorem is nmed fter Thoms Byes, nonconformist English clergymn who did erly work in proility nd decision theory during the 18th century. Let X e dt tuple. In Byesin terms, X is considered evidence. As usul, it is descried y mesurements mde on set of n ttriutes. Let H e some hypothesis, such s tht the dt tuple X elongs to specified clss C. For clssifiction prolems, we wnt to determine P(H X), the proility tht the hypothesis H holds given the evidence or oserved dt tuple X. In other words, we re looking for the proility tht tuple X elongs to clss C, given tht we know the ttriute description of X. P(H X) is the posterior proility, or posterior proility, of H conditioned on X. For exmple, suppose our world of dt tuples is confined to customers descried y the ttriutes ge nd income, respectively, nd tht X is 35-yer-old customer with n income of Rs4,. Suppose tht H is the hypothesis tht our customer will uy computer. Then P(H X) reflects the proility tht customer X will uy computer given tht we know the customers ge nd income. In contrst, P(H) is the prior proility, or prior proility, of H. For our exmple, this is the proility tht ny given customer will uy computer, regrdless of ge, income, or ny other informtion, for tht mtter. The posterior proility, P(H X), is sed on more informtion (e.g., customer informtion) thn the prior proility, P(H), which is independent of X. Similrly, P(X H) is the posterior proility of X conditioned on H. Tht is, it is the proility tht customer, X, is 35 yers old nd erns Rs4,, given tht we know the customer will uy computer. P(X) is the prior proility of X.Using our exmple, it is the proility tht person from our set of customers is 35 yers old nd erns Rs4,. How re these proilities estimted? P(H), P(X H), nd P(X) my e estimted from the given dt, s we shll see elow. Byes theorem is useful in tht it provides wy of clculting the CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 12

Dt Mining Models And Evlution Technique posterior proility, P(H X), from P(H), P(X H), nd P(X). Byes theorem is P(H X) = P(X H)P(H) P(X) (3.1) 3.3 The Nive Byes Algorthim The nive Byes (NB) lgorithm descries simple ppliction using Byes theorem for clssifiction. Although it is not the only mchine lerning method utilizing Byesin methods, it is the most common, prticulrly for text clssifiction where it hs ecome the de fcto stndrd. Strengths nd weknesses of this lgorithm re s follows Strength 1. Simple, fst, nd very effective 2. Does well with noisy nd missing dt 3. Requires reltively few exmples for trining, ut lso works well with very lrge numers of exmples 4. Esy to otin the estimted proility for prediction Weknesses 1. Relies on n often-fulty ssumption of eqully importnt nd independent fetures 2. Not idel for dtsets with lrge 3. Estimted proilities re less relile thn the predicted clsses The nive Byes lgorithm is nmed s such ecuse it mkes couple of nive ssumptions out the dt. In prticulr, nive Byes ssumes tht ll of the fetures in the dtset re eqully importnt nd independent. These ssumptions re rrely true in most of the rel-world pplictions. For exmple, if you were ttempting to identify spm y monitoring emil messges, it is lmost certinly true tht some fetures will e more importnt thn others. For exmple, the sender of the emil my e more importnt indictor of spm thn the messge text. Additionlly, the words tht pper in the messge CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 13

Dt Mining Models And Evlution Technique ody re not independent from one nother, since the ppernce of some words is very good indiction tht other words re lso likely to pper. A messge with the word Vigr is proly likely to lso contin the words prescription or drugs. Proilistic Lerning Clssifiction Using Nive Byes [ 96 ] However, in most cses when these ssumptions re violted, nive Byes still performs firly well. This is true even in extreme circumstnces where strong dependencies re found mong the fetures. Due to the lgorithm s verstility nd ccurcy cross mny types of conditions, nive Byes is often strong first cndidte for clssifiction lerning tsks. 3.4 Nive Byesin Clssifiction The nve Byesin clssifier, or simple Byesin clssifier, works s follows: 1. Let D e trining set of tuples nd their ssocited clss lels. As usul, ech tuple is represented y n n-dimensionl ttriute vector, X = (x 1,x 2,...,x n ), depicting n mesurements mde on the tuple from n ttriutes, respectively A 1,A 2,...,A n. 2. Suppose tht there re m clsses, C 1,C 2,...,C m. Given tuple, X, the clssifier will predict tht X elongs to the clss hving the highest posterior proility, conditioned on X. Tht is, the nve Byesin clssifier predicts tht tuple X elongs to the clss C i if nd only if P(C i X) > P(C j X) f or1 < j < m, j6 = i : (3.2) Thus we mximize P(C i X). The clss for which P(C i X) is mximized is clled the mximum posterior hypothesis. By Byes theorem P(C i X) = P(X C i) P(C i ) P(X) (3.3) 3. As P(X) is constnt for ll clsses, only P(X C i )P(C i ) need e mximized. If the clss prior proilities re not known, then it is commonly ssumed tht the clsses re eqully likely, tht is, P(C 1 ) = P(C 2 ) == P(C m ), nd we would therefore mximize P(X C i ). Otherwise, we mximize P(X C i )P(C i ). Note tht the clss prior proilities my e estimted y P(C i ) = C i,d / D,where C i,d is the numer of trining tuples of clss C i in D. 4. Given dt sets with mny ttriutes, it would e extremely computtionlly CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 14

Dt Mining Models And Evlution Technique expensive to compute P(X C i ). In order to reduce computtion in evluting P(X C i ), the nive ssumption of clss conditionl independence is mde. This presumes tht the vlues of the ttriutes re conditionlly independent of one nother, given the clss lel of the tuple (i.e., tht there re no dependence reltionships mong the ttriutes). Thus, P(X C i ) = k P(x k C i ) (3.4) n=1 5. In order to predict the clss lel of X, P(X C i )P(C i ) is evluted for ech clss C i. The clssifier predicts tht the clss lel of tuple X is the clss C i if nd only if P(X C i )P(C i ) > P(X C j )P(C j ) f or1 < j < m, j! = i (3.5) In other words, the predicted clss lel is the clss C i for which P(X C i )P(C i ) is the mximum. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 15

Dt Mining Models And Evlution Technique Chpter 4 Model Evlution Techniques As We now hve in depth explored the two most widely used clssifier models the question which we now fce is how ccurtely these clssifiers cn predict the future trends sed on the dt used for uilding these clssifier viz. How ccurtely customer recomender system of compny cn predict the future purchsing ehvior of the customer sed on the previously recorded sles dt of the customers. Given the signifnct role these clssifiers ply their ccurcy ecomes of prime importnce to the compnies speiclly to those in e-commerce system. Thus the model evlution techniques re employed to evlute the ccurcy of the predictions mde y clssifier model.as different clssifier models hve vrying strengths nd weknesses, it is necessry to use test tht revel distinctions mong the lerners when mesuring how model will perform on future dt.the Succeeding sections in this chpters will primrily focus on the following points. 1. The reson why predictive ccurcy is not sufficient to mesure performnce nd wht re the other lterntives to mesure the ccurcy 2. Methods to ensure tht the performnce mesures resonly reflect model s ility to predict or forecst unseen dt 4.1 Prediction Accurcy The prediction ccurcy of clssifier model is defined proportion of correct predictions y the totl numer of predictions. This numer indictes the percentge of cses in which the lerner is right or wrong. For instnce, suppose clssifier correctly identified whether or not 99,99 out of 1, neworn ies re crriers of tretle ut potentilly-ftl genetic defect. This would imply n ccurcy of 99.99 percent nd n error rte of only.1 percent. CSE Deprtment,Institute of Technology, Nirm University 16

Dt Mining Models And Evlution Technique Although this would pper to indicte n extremely ccurte clssifier, it would e wise to collect dditionl informtion efore trusting your child s life to the test. Wht if the genetic defect is found in only 1 out of every 1, ies? A test tht predicts no defect regrdless of circumstnces will still e correct for 99.99 percent of ll cses. In this cse, even though the predictions re correct for the lrge mjority of dt, the clssifier is not very useful for its intended purpose, which is to identify children with irth defects. The est mesure of clssifier performnce is whether the clssifier is successful t its intended purpose. For this reson, it is crucil to hve mesures of model performnce tht mesure utility rther thn rw ccurcy 4.2 Confusion Mtrix nd Model Evlution Metrics A confusion mtrix is mtrix tht ctegorizes predictions ccording to whether they mtch the ctul vlue in the dt. One of the tle s dimensions indictes the possile ctegories of predicted vlues while the other dimension indictes the sme for ctul vlues.it cn e n order of n-mtrix depending on the vlues which cn e chieved y the predicted clss.figure 4.1(Refernece [2]) depicts 2x2 nd 3x3 confusion mtrix. There re four importnt terms tht re considered s the uilding Figure 4.1: Confusion Mtrox(Ref[2]) locks used in computing mny evlution mesures.the clss of interest is known s the positive clss, while ll others re known s negtive. 1. True Positives(T P):Correctly clssified s the clss of interest. 2. True Negtives (T N):Correctly clssified s not the clss of interest. 3. Flse Positives(FP):Incorrectly clssified s the clss of interest. 4. Flse Negtives(FN):Incorrectly clssified s not the clss of interest. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 17

Dt Mining Models And Evlution Technique The confusion mtrix is useful tool for nlysing how well our clssifier cn recognize tuples of different clsses. TP nd TN tell us when the clssifier is getting things right, while FP nd FN tell us when the clssifier is getting things wrong.given m clsses, confusion mtrix is mtrix of tlest m y m size An entry, CM i j in the first m rows nd m columns indictes the numer of tuples of clss i tht were leled y the clssifier s clss j. For clssifier to hve good ccurcy, idelly most of the tuples would e represented long the digonl of the confusion mtrix from the entry CM 1,1 to entry CM m,m, with the rest of the entries eing zero or close to zero. Tht is idelly, FP nd FN re round zero. Accurcy: The ccurcy of clssifier on given test set is the percentge of test tuples tht re correctly clssified y the clssifier. ccurcy = T P + T N P + N (4.1) Error Rte: Error Rte or miss clssifiction rte of clssifier, M, which is simply 1 ccurcy(m), where ccurcy (M) is the ccurcy of M. errorrte = FP + FN P + N (4.2) If we use the trining set insted of test set to estimte the error rte of model, this quntity is known s the re-sustituion error.this error estimte is optimistic of the true error rte ecuse the model is not tested on ny smples tht it hs not lredy seen. The Clss Imlnce Prolem: the dtsets where the min clss of interest is rre. Tht is the dt set distriution reflects significnt mjority of the negtive clss nd minority positive clss.for exmple in frud detection pplictions, the clss of interest frudulent clss is rre or less frequently occurring in comprison to the negtive non-frudulent clss.in medicl dt there my e rre clss, such s cncer. Suppose tht we hve trined clssifier to clssify medicl dt tuples,where the clss lel ttriute is cncer nd the possile clss vlues re yes nd no. An ccurcy rte of sy 97% my mke the clssifier seem quite ccurte, ut wht if only, sy 3% of the trining tuples re ctully cncer? Clerly n ccurcy rte of 97% my not e cceptle- the clssifier could e correctly leling only the non-cncer tuples, for instnce,nd miss clssifying ll the cncer tuples. Insted we need other mesures which ccess how well the clssifier cn recognize CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 18

Dt Mining Models And Evlution Technique the positive tuples nd how well it cn recognize the negtive tuples Sensitivity nd Specificity:Clssifiction often involves lnce etween eing overly conservtive nd overly ggressive in decision mking. For exmple, n e-mil filter could gurntee to eliminte every spm messge y ggressively eliminting nerly every hm messge t the sme time. On the other hnd, gurntee tht no hm messges will e indvertently filtered might llow n uncceptle mount of spm to pss through the filter. This trdeoff is cptured y pir of mesures: sensitivity nd specificity. The sensitivity of model (lso clled the true positive rte), mesures the proportion of positive exmples tht were correctly clssified. Therefore, s shown in the following formul, it is clculted s the numer of true positives divided y the totl numer of positives in the dtthose correctly clssified (the true positives), s well s those incorrectly clssified (the flse negtives). sensitivity = T P T P + FN (4.3) The specificity of model (lso clled the true negtive rte), mesures the proportion of negtive exmples tht were correctly clssified. As with sensitivity, this is computed s the numer of true negtives divided y the totl numer of negtivesthe true negtives plus the flse positives. speci f icity = T N T N + FP (4.4) Precision nd recll: Closely relted to sensitivity nd specificity re two other performnce mesures, relted to compromises mde in clssifiction: precision nd recll. Used primrily in the context of informtion retrievl, these sttistics re intended to provide n indiction of how interesting nd relevnt model s results re, or whether the predictions re diluted y meningless noise. The precision (lso known s the positive predictive vlue) is defined s the proportion of positive exmples tht re truly positive; in other words, when model predicts the positive clss, how often is it correct? A precise model will only predict the positive clss in cses very likely to e positive. It will e very trustworthy. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 19

Dt Mining Models And Evlution Technique Consider wht would hppen if the model ws very imprecise. Over time, the results would e less likely to e trusted. In the context of informtion retrievl, this would e similr to serch engine such s Google returning unrelted results. Eventully users would switch to competitor such s Bing. In the cse of the SMS spm filter, high precision mens tht the model is le to crefully trget only the spm while ignoring the hm. precision = T P T P + FP (4.5) On the other hnd, recll is mesure of how complete the results re. As shown in the following formul, this is defined s the numer of true positives over the totl numer of positives. We my recognize tht this is the sme s sensitivity, only the interprettion differs. A model with high recll cptures lrge portion of the positive exmples, mening tht it hs wide redth. For exmple, serch engine with high recll returns lrge numer of documents pertinent to the serch query. Similrly, the SMS spm filter hs high recll if the mjority of spm messges re correctly identified. recll = T P T P + FN (4.6) The F-Mesure: A mesure of model performnce tht comines precision nd recll into single numer is known s the F-mesure (lso sometimes clled the F1 score or the F-score). The F-mesure comines precision nd recll using the hrmonic men. The hrmonic men is used rther thn the more common rithmetic men since oth precision nd recll re expressed s proportions etween zero nd one. The following is the formul for F-mesure: F Mesure = 2 precision recll recll + precision (4.7) F β = (1 + β2 ) precision recll β 2 precision + recll (4.8) In ddition to ccurcy-sed mesures, clssifiers cn lso e compred with respect to the following dditionl spects: 1. Speed: This refers to the computtionl costs involved in generting nd using the given clssifier 2. Roustness: This is the ility of the clssifier to mke correct prediction given noisy dt or dt with missing vlues.roustness is typiclly ssessed CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 2

Dt Mining Models And Evlution Technique with series of synthetic dt sets represeting incresing degress of noise nd missing vlues. 3. Sclility: This refers to the ility to construct the clssifier efficiently given lrge mounts of dt. Sclility is typiclly ssessed with series of dt sets of incresing size. 4. Interpretility : This refers to the level of understnding nd insight tht is provided y the clssifier or predictor. Interpretility is sujective nd therefore more difficult to sses CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 21

Dt Mining Models And Evlution Technique 4.3 How To Estimte These Metrics? We cn use following methods to estimte the evlution metrics explined indepth in the preceding sections:. Trining dt. Independent test dt c. Hold-out method d. k-fold cross-vlidtion method e. Leve-one-out method f. Bootstrp method g. Compring Two Models 4.3.1 Trining nd Independent Test Dt The ccurcy/error estimtes on the trining dt re not good indictors of performnce on future dt. Becuse new dt will proly not e exctly the sme s the trining dt.the ccurcy/error estimtes on the trining dt mesure the degree of clssifiers over-fitting.fig 4.2 depicts use of trining set Estimtion with Figure 4.2: Trining Set independent test dt (figure 4.3)is used when we hve plenty of dt nd there is nturl wy to forming trining nd test dt. For exmple: Quinln in 1987 reported experiments in medicl domin for which the clssifiers were trined on dt from 1985 nd tested on dt from 1986. Figure 4.3: Trining nd Test Set CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 22

Dt Mining Models And Evlution Technique Figure 4.4: Clssifiction: Trin, Vlidtion, Test Split Reference[3] 4.3.2 Holdout Method The holdout method(fig4.5) is wht we hve lluded to so fr in our discussions out ccurcy. In this method, the given dt re rndomly prtitioned into two independent sets, trining set nd test set. Typiclly, two-thirds of the dt re llocted to the trining set, nd the remining one-third is llocted to the test set. The trining set is used to derive the model, whose ccurcy is estimted with the test set. The estimte is pessimistic ecuse only portion of the initil dt is used to derive the model. The hold-out method is usully used when we hve thousnds Figure 4.5: Holdout Method CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 23

Dt Mining Models And Evlution Technique of instnces, including severl hundred instnces from ech clss. For unlnced dt-sets, smples might not e representtive.few or no instnces of some clsses will e there in cse of clss imlnced dt where one clss is in mjority viz. frudulent trnsction detection nd Medicl Dignostic Tests. To mke the smple Representtive for holdout we use the concept of strtifiction in this we ensure tht ech clss gets equl representtion ccording to their proportion in ctul dt-set Rndom su-smpling is vrition of the holdout method in which the holdout method is repeted k times.in ech itertion, certin proportion is rndomly selected for trining (possily with strtifiction). The error rtes on the different itertions re verged to yield n overll error rte.it is lso known s repeted holdout method. 4.3.3 K-Cross-vlidtion In k-fold cross-vlidtion, (fig 4.6) the initil dt re rndomly prtitioned into k mutully exclusive susets or folds, D 1,D 2,...,Dk ech of pproximtely equl size. Trining nd testing is performed k times. In itertion i, prtition D i is reserved s the test set, nd the remining prtitions re collectively used to trin the model. Tht is, in the first itertion, susets D 2,..,D k collectively serve s the trining set in order to otin first model, which is tested on D 1 ; the second itertion is trined on susets D 1,D 3,...,D k nd tested on D 2 ; nd so on. Unlike the holdout nd rndom susmpling methods ove, here, ech smple is used the sme numer of times for trining nd once for testing. For clssifiction, the ccurcy estimte is the overll numer of correct clssifictions from the k itertions, divided y the totl numer of tuples in the initil dt. For prediction, the error estimte cn e computed s the totl loss from the k itertions, divided y the totl numer of initil tuples. Leve one out CV Leve-one-out is specil cse of k-fold cross-vlidtion where k is set to the numer of initil tuples. Tht is, only one smple is left out t time for the test set. Some fetures of Leve one out CV re 1. Mkes est use of the dt. 2. Involves no rndom su-smpling. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 24

Dt Mining Models And Evlution Technique Figure 4.6: k-cross-vlidtion Disdvntges of Leve one out CV: 1. A disdvntge of Leve-One-Out-CV is tht strtifiction is not possile: 2. Very computtionlly expensive. 4.3.4 Bootstrp Cross vlidtion uses smpling without replcement. The sme instnce, once selected, cn not e selected gin for prticulr trining/test set.the The ootstrp uses smpling with replcement to form the trining set: 1. Smple dtset of n instnces n times with replcement to form new dtset of n instnces. 2. Use this dt s the trining set. 3. Use the instnces from the originl dtset tht dont occur in the new trining set for testing. 4. A prticulr instnce hs proility of 1n 1 of not eing picked Thus its proility of ending up in the test dt is (where n tends to infinity): (1 1 n )n = e 1 =.368 (4.9) 5. This mens the trining dt will contin pproximtely 63.2% of the instnces nd the test dt will contin pproximtely 36.8% of the instnces. 6. The error estimte on the test dt will e very pessimistic ecuse the clssifier is trined on just 63% of the instnces. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 25

Dt Mining Models And Evlution Technique 7. Therefore, comine it with the trining error: err =.632.e +.368 e (4.1) 8. The trining error gets less weight thn the error on the test dt. 4.3.5 Compring Two Clssifier Models Suppose tht we hve generted two models, M1 nd M2 (for either clssifiction or prediction), from our dt. We hve performed 1-fold cross-vlidtion to otin men error rte for ech. How cn we determine which model is est? It my seem intuitive to select the model with the lowest error rte, however, the men error rtes re just estimtes of error on the true popultion of future dt cses. There cn e considerle vrince etween error rtes within ny given 1-fold cross-vlidtion experiment. Although the men error rtes otined for M1 nd M2 my pper different, tht difference my not e sttisticlly significnt. Wht if ny difference etween the two my just e ttriuted to chnce? following points explin in detil how sttisticlly significnt is theirr difference 1. Assume tht we hve two clssifiers, M 1 nd M 2, nd we would like to know which one is etter for clssifiction prolem. 2. We test the clssifiers on n test dt sets D 1,D 2,,D n nd we receive error rte estimtes e 11,e 12,,e 1n for clssifier M 1 nd error rte estimtes e21,e 22,,e 2n for clssifier M 2. 3. Using rte estimtes we cn compute the men error rte e1 for clssifier M 1 nd the men error rte e 2 for clssifier M 2. 4. These men error rtes re just estimtes of error on the true popultion of future dt cses. 5. We note tht error rte estimtes e 11,e 12,,e1n for clssifier M 1 nd error rte estimtes e 21,e 22,,e 2n for clssifier M 2 re pired. Thus, we consider the differences d 1,d 2,,d n where d j = e 1 j e 2 j. 6. The differences d 1,d 2,,d n re instntitions of n rndom vriles D 1,D 2,,D n with men D nd stndrd devition D. 7. We need to estlish confidence intervls for D in order to decide whether the difference in the generliztion performnce of the clssifiers M 1 nd M 2 is sttisticlly significnt or not. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 26

Dt Mining Models And Evlution Technique 8. Since the stndrd devition D is unknown, we pproximte it using the smple stndrd devition s d : n 1 s d = n [(e 1i e 2i ) (e 1 e 2 )] 2 (4.11) i=1 9. T-sttistics T = D µd s d n (4.12) 1. The T sttistics is governed y t-distriution with n - 1 degrees of freedom. Figure 4.7 shows t-distriution curve (Refernce[4]) Figure 4.7: t-distriution curve (Reference [4]) 11. If d nd sd re the men nd stndrd devition of the normlly distriuted differences of n rndom pirs of errors, (1 )1% confidence intervl for D = 1-2 is : d m t α1/2 s d n < µ d < d m +t α1/2 s d n (4.13) where t α/2 is the t-vlue with v = n 1 degrees of freedom, leving n re of α/2 to the right. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 27

Dt Mining Models And Evlution Technique 12. If t > z or t < z t doesn t lie in the rejection region, within the tils of the distriution. This mens tht we cn reject the null hypothesis tht the mens of M 1 nd M 2 re the sme nd conclude tht there is sttisticlly significnt difference etween the two models. Otherwise, if we cnnot reject the null hypothesis, we then conclude tht ny difference etween M 1 nd M 2 cn e ttriuted to chnce. 4.4 ROC Curves The ROC curve (Receiver Operting Chrcteristic) is commonly used to exmine the trdeoff etween the detection of true positives, while voiding the flse positives. As you might suspect from the nme, ROC curves were developed y engineers in the field of communictions round the time of World Wr II; receivers of rdr nd rdio signls needed method to discriminte etween true signls nd flse lrms. The sme technique is useful tody for visulizing the efficcy of mchine lerning models. The chrcteristics of typicl ROC digrm re depicted in the following plot(figure 4.8 Reference[2]). Curves re defined on plot with the proportion of true positives on the verticl xis, nd the proportion of flse positives on the horizontl xis. Becuse these vlues re equivlent to sensitivity nd (1 specificity), respectively, the digrm is lso known s sensitivity/specificity plot: Figure 4.8: ROC curves (Reference[2]) CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 28

Dt Mining Models And Evlution Technique The points comprising ROC curves indicte the true positive rte t vrying flse positive thresholds. To crete the curves, clssifier s predictions re sorted y the model s estimted proility of the positive clss, with the lrgest vlues first. Beginning t the origin, ech prediction s impct on the true positive rte nd flse positive rte will result in curve trcing verticlly (for correct prediction), or horizontlly (for n incorrect prediction). To illustrte this concept, three hypotheticl clssifiers re contrsted in the previous plot. First, the digonl line from the ottom-left to the top-right corner of the digrm represents clssifier with no predictive vlue. This type of clssifier detects true positives nd flse positives t exctly the sme rte, implying tht the clssifier cnnot discriminte etween the two. This is the seline y which other clssifiers my e judged; ROC curves flling close to this line indicte models tht re not very useful. Similrly, the perfect clssifier hs curve tht psses through the point t 1 percent true positive rte nd percent flse positive rte. It is le to correctly identify ll of the true positives efore it incorrectly clssifies ny negtive result. Most rel-world clssifiers re similr to the test clssifier; they fll somewhere in the zone etween perfect nd useless. The closer the curve is to the perfect clssifier, the etter it is t identifying positive vlues. This cn e mesured using sttistic known s the re under the ROC curve (revited AUC). The AUC, s you might expect, trets the ROC digrm s two-dimensionl squre nd mesures the totl re under the ROC curve. AUC rnges from.5 (for clssifier with no predictive vlue), to 1. (for perfect clssifier). A convention for interpreting AUC scores uses system similr to cdemic letter grdes: 1..9 1. = A (outstnding) 2..8.9 = B (excellent/good) 3..7.8 = C (cceptle/fir) 4..6.7 = D (poor) 5..5.6 = F (no discrimintion) As with most scles similr to this, the levels my work etter for some tsks thn others; the ctegoriztion is somewht sujective. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 29

Dt Mining Models And Evlution Technique 4.5 Ensemle Methods Motivtion 1. Ensemle model improves ccurcy nd roustness over single model methods 2. Applictions: () distriuted computing () privcy-preserving pplictions (c) lrge-scle dt with reusle models (d) multiple sources of dt 3. Efficiency: complex prolem cn e decomposed into multiple su-prolems tht re esier to understnd nd solve (divide-nd-conquer pproch) 4.5.1 Why Ensemle Works? 1. Intuition comining diverse, independent opinions in humn decision-mking s protective mechnism e.g. stock portfolio 2. Overcome limittions of single hypothesis The trget function my not e implementle with individul clssifiers, ut my e pproximted y model verging 3. Gives glol picture Figure 4.9: Ensemle Gives Glol picture CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 3

Dt Mining Models And Evlution Technique 4.5.2 Ensemle Works in Two Wys 1. Lern to Comine Figure 4.1: Lern to Comine(Reference[3]) 2. Lern By Consensus Figure 4.11: Lern By Consensus(Refrence[3]) CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 31

Dt Mining Models And Evlution Technique 4.5.3 Lern To Comine Pros 1. Get useful feed cks from leled dt. 2. Cn potentilly improve ccurcy. Cons 1. Need to keep the leled dt to trin the ensemle 2. My overfit the leled dt. 3. Cnnot work when no lels re ville 4.5.4 Lern By Consensus Pros 1. Do not need leled dt. 2. Cn improve the generliztion performnce. Cons 1. No feedcks from the leled dt. 2. Require the ssumption tht consensus is etter. 4.5.5 Bgging Given set, D, of d tuples, gging works s follows. For itertion i(i = 1,2,...,k), trining set,di, of d tuples is smpled with replcement fromthe originl set of tuples,d. The term gging stnds for ootstrp ggregtion.ech trining set is ootstrp smple. Becuse smpling with replcement is used, some of the originl tuples of D my not e included in D i,wheres other smy occur more thn once. A clssifier model, M i, is lerned for ech trining set, D i. To clssify n unknown tuple, X, ech clssifier, M i, returns its clss prediction, which counts s one vote. The gged clssifier, M, counts the votes nd ssigns the clss with the most votes to X. Bgging cn e pplied to the prediction of continuous vlues y tking the verge vlue of ech prediction for given test tuple.. The gged clssifier often hs significntly greter ccurcy thn single clssifier derived from D, the originl trining dt. It will not e considerly worse nd is more roust to the effects of noisy dt. The incresed ccurcy occurs ecuse the composite model reduces CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 32

Dt Mining Models And Evlution Technique the vrince of the individul clssifiers. For prediction, it ws theoreticlly proven tht gged predictor will lwys hve improved ccurcy over single predictor derived from D. Alogrithm The gging lgorithmcrete n ensemle of models (clssifiers or predictors) for lerning scheme where ech model gives n eqully-weighted prediction. Input: D, set of d trining tuples; k, the numer of models in the ensemle; lerning scheme (e.g., decision tree lgorithm, ck-propgtion, etc.) Output: A composite model, M. Method: (1) for i = 1tokdo // crete k models: (2) crete ootstrp smple, D i, y smpling D with replcement; (3) use D i to derive model, M i ; (4) end for To use the composite model on tuple, X (1) if clssifiction then (2) let ech of the k models clssify X nd return the mjority vote (3) if prediction then (4) let ech of the k models predict vlue for X nd return the verge predicted vlue; 4.5.6 Boosting Principles 1. Boost set of wek lerners to strong lerner 2. An itertive procedure to dptively chnge distriution of trining dt y focusing more on previously misclssified records 3. Initilly, ll N records re ssigned equl weights Unlike gging, weights my chnge t the end of oosting round 4. Records tht re wrongly clssified will hve their weights incresed 5. Records tht re clssified correctly will hve their weights decresed 6. Equl weights re ssigned to ech trining tuple (1/d for round 1) 7. After clssifier M i is lerned, the weights re djusted to llow the susequent clssifier M i+1 to py more ttention to tuples tht were misclssified y M i. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 33

Dt Mining Models And Evlution Technique 8. Finl oosted clssifier M comines the votes of ech individul clssifier Weight of ech clssifiers vote is function of its ccurcy 9. Adoost populr oosting lgorithm Adoost-Boosting Algorithm Input: 1). Trining set D contining d tuples 2). k rounds 3). A clssifiction lerning scheme Output: A composite model Method: 1. Dt set D contining d clss-leled tuples (X 1,y 1 ),(X 2,y 2 ),(X 3,y 3 ),.(X d,y d ) 2. Initilly ssign equl weight 1/d to ech tuple 3. To generte k se clssifiers, we need k rounds or 4. itertions Round i, tuples from D re smpled with replcement, to form D i (size d) 5. Ech tuples chnce of eing selected depends on its weight 6. Bse clssifier M i, is derived from trining tuples of D i 7. Error of M i is tested using D i ]item Weights of trining tuples re djusted depending on how they were clssified Correctly clssified: Decrese weight Incorrectly clssified: Increse weight 8. Weight of tuple indictes how hrd it is to clssify it (directly proportionl) 9. Some clssifiers my e etter t clssifying some hrd tuples thn others 1. We finlly hve series of clssifiers tht complement ech other 11. Error Estimte: error(m i ) = d j where err(x j ) is the misclssifiction error for X j (= 1) 12. If clssifier error exceeds.5, we ndon it w j err(x j ) (4.14) 13. Try gin with new D i nd new M i derived from it error (Mi) ffects how the weights of trining tuples re updted CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 34

Dt Mining Models And Evlution Technique 14. If tuple is correctly clssified in round i, its weight is multiplied y error(m i ) 1 erro(m i ) (4.15) 15. Adjust weights of ll correctly clssified tuples 16. Now weights of ll tuples (including the misclssified tuples) re normlized n f = sumo f oldweights sumo f newweights (4.16) 17. Weight of clssifier Mis weight is log error(m i) 1 erro(m i ) 18. The lower clssifier error rte, the more ccurte it is, nd therefore, the higher its weight for voting should e 19. Weight of clssifier Mis vote is log error(m i) 1 erro(m i ) 2. For ech clss c, sum the weights of ech clssifier tht ssigned clss c to X (unseen tuple) 21. The clss with the highest sum is the WINNER! CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 35

Dt Mining Models And Evlution Technique Chpter 5 Conclusion nd Future Scope 5.1 Comprtive Study To prticlly explore the theoreticl spects of the dt mining models nd the techniques to evlute them, we conducted smll scle explortory study in dt mining tool Wek - developed y University of Wikdo, Newzelnd.The following tles summrize the result of our explortory study Figure 5.1: Wek Screen Shots CSE Deprtment,Institute of Technology, Nirm University 36