Data Mining Models and Evaluation Techniques

Size: px
Start display at page:

Download "Data Mining Models and Evaluation Techniques"

Transcription

1 Dt Mining Models nd Evlution Techniques Shuhm Pchori 12BCE55 DEPARTMENT OF COMPUTER ENGINEERING AHMEDABAD Novemer 214

2 Dt Mining Models And Evlution Techniques Seminr Sumitted in prtil fulfillment of the requirements For the degree of Bchelor of Technology in Computer Science nd Engineering Shuhm Pchori 12BCE55 DEPARTMENT OF COMPUTER ENGINEERING AHMEDABAD Novemer 214

3 Dt Mining Models And Evlution Technique CERTIFICATE This is to certify tht the seminr entitled Dt Mining Models nd Evlution Techniques sumitted y Shuhm Pchori (12BCE55), towrds the prtil fulfillment of the requirements for the degree of Bchelor of Technology in Computer Science And Engineering Nirm University, Ahmedd is the record of work crried out y him under my supervision nd guidnce. In my opinion, the sumitted work hs reched level required for eing ccepted for exmintion. The results emodied in this Seminr, to the est of my knowledge, hvent een sumitted to ny other university or institution for wrd of ny degree or diplom. Prof. K.P.Agrwl Associte Professor, CSE Deprtment, Institute Of Technology, Nirm University, Ahmedd. Prof. Anuj Nir Assistnt Professor, CSE Deprtment, Institute Of Technology, Nirm University, Ahmedd. Dr. Snjy Grg Prof & Hed Of Deprtment, CSE Deprtment, Institute Of Technology, Nirm University, Ahmedd. CSE Deprtment,Institute of Technology, Nirm University i

4 Acknowledgements Dt Mining Models And Evlution Technique I m profoundly grteful to Prof. K P AGARWAL for his expert guidnce throughout the project.his continuous encourgement hve fetched us the golden results. His elixir of knowledge in the field hs mde this project chieve its zenith nd crediility. I would like to express deepest pprecition towrds, Prof. SANJAY GARG, Hed of Deprtment of Computer Engineering nd Prof. ANUJA NAIR, whose invlule guidnce supported us in completing this project. At lst I must express my sincere hertfelt grtitude to ll the stff memers of Computer Engineering Deprtment who helped me directly or indirectly during this course of work. SHUBHAM PACHORI 12BCE55 CSE Deprtment,Institute of Technology, Nirm University ii

5 Astrct Dt Mining Models And Evlution Technique Dtses re rich with hidden informtion tht cn e used for intelligent decision mking. Clssifiction nd prediction re two forms of dt nlysis tht cn e used to extrct models descriing importnt dt clsses or to predict future dt trends. Such nlysis cn help provide us with etter understnding of the dt t lrge. Clssifiction models predicts ctegoricl (discrete, unordered) lel functions. For exmple, we cn uild clssifiction model to ctegorize nk lon pplictions s either sfe or risky. As predictions lwys hve n implicit cost involved it is importnt to evlute clssifiers generliztion performnce in order to determine whether to employ the clssifier (For exmple: when lerning the effectiveness of medicl tretments from limited-size dt, it is importnt to estimte the ccurcy of the clssifiers.) nd to Optimize the clssifier. (For exmple: when post-pruning decision trees we must evlute the ccurcy of the decision trees on ech pruning step.) This seminr report gives n in depth explntion of clssifier models (viz. Nive Byesin nd Decision Trees) nd how these clssifier models re evluted for their ccurcy on predictions. The lter prt of the report lso dels with how to improve the ccurcy of these clssifier models nd it includes n explortory study compring the vrious model evlution techniques, crried out in Wek(A GUI Bsed Dt Mining Tool) on representtive dt sets. CSE Deprtment,Institute of Technology, Nirm University iii

6 Dt Mining Models And Evlution Technique Contents Certificte ii Acknowledgements iii Astrct iv 1 Introduction 2 2 Clssifiction Using Decision Tree Understnding Decision Trees Divide nd Conquer C5. Decision Tree Algorithm How To Choose The Best Split? Pruning The Decision Tree Proilistic Lerning - Nive Byesin Clssifiction Understnding Nive Byesin Clssifiction Byes Theorem The Nive Byes Algorthim Nive Byesin Clssifiction Model Evlution Techniques Prediction Accurcy Confusion Mtrix nd Model Evlution Metrics How To Estimte These Metrics? Trining nd Independent Test Dt Holdout Method K-Cross-vlidtion Bootstrp Compring Two Clssifier Models ROC Curves Ensemle Methods Why Ensemle Works? CSE Deprtment,Institute of Technology, Nirm University iv

7 Dt Mining Models And Evlution Technique Ensemle Works in Two Wys Lern To Comine Lern By Consensus Bgging Boosting Conclusion nd Future Scope Comprtive Study Conclusion Future Scope References 4 CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd v

8 Dt Mining Models And Evlution Technique Chpter 1 Introduction The term Knowledge Discovery in Dtses, or KDD for short, refers to the rod process of finding knowledge in dt, nd emphsizes the high-level ppliction of prticulr dt mining methods. It is of interest to reserchers in mchine lerning, pttern recognition, dtses, sttistics, rtificil intelligence, knowledge cquisition for expert systems, nd dt visuliztion.the unifying gol of the KDD process is to extrct knowledge from dt in the context of lrge dtses.it does this y using dt mining methods (lgorithms) to extrct (identify) wht is deemed knowledge, ccording to the specifictions of mesures nd thresholds, using dtse long with ny required preprocessing, susmpling, nd trnsformtions of tht dtse. The overll process of finding nd interpreting ptterns from dt involves the repeted ppliction of the following steps: Figure 1.1: KDD Process 1. Developing n understnding of 1. the ppliction domin CSE Deprtment,Institute of Technology, Nirm University 1

9 Dt Mining Models And Evlution Technique 2. the relevnt prior knowledge 3. the gols of the end-user 2. Creting trget dt set: selecting dt set, or focusing on suset of vriles, or dt smples, on which discovery is to e performed. 3. Dt clening nd preprocessing. 1. Removl of noise or outliers. 2. Removl of noise or outliers. 3. Strtegies for hndling missing dt fields. 4.Accounting for time sequence informtion nd known chnges. 4. Dt reduction nd projection. 1.Finding useful fetures to represent the dt depending on the gol of the tsk. 2.Using dimensionlity reduction or trnsformtion methods to reduce the effective numer of vriles under considertion or to find invrint representtions for the dt. 5. Choosing the dt mining tsk.deciding whether the gol of the KDD process is clssifiction, regression, clustering, etc. 6. Choosing the dt mining lgorithm 1. Selecting method(s) to e used for serching for ptterns in the dt. 2. Deciding which models nd prmeters my e pproprite. 3. Mtching prticulr dt mining method with the overll criteri of the KDD process. 7. Dt mining. Serching for ptterns of interest in prticulr representtionl form or set of such representtions s clssifiction rules or trees, regression, clustering, nd so forth. 8. Interpreting mined ptterns. 9. Consolidting discovered knowledge In the following chpters we will e exploring Dt Mining Models nd Evlution techniques in depth CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 2

10 Dt Mining Models And Evlution Technique Chpter 2 Clssifiction Using Decision Tree This chpter introduces the concept of the most widely used lerning method tht pply similr strtegy of dividing dt into smller nd smller portions to identify ptterns tht cn e used for prediction. The knowledge is then presented in the form of logicl structures tht cn e understood without ny sttisticl knowledge. This spect mkes these models prticulrly useful for usiness strtegy nd process improvement. 1. Understding Decision Tress 2. Divide nd Conquer 3. Unique identifiers 4. C 5. Decision Tree Algorithm 5. Choosing The Best Split 6. Pruning The Decision Tress 2.1 Understnding Decision Trees As we might intuit from the nme itself, decision tree lerners uild model in the form of tree structure. The model itself comprises series of logicl decisions, similr to flowchrt, with decision nodes tht indicte decision to e mde on n ttriute. These split into rnches tht indicte the decision s choices. The tree is terminted y lef nodes (lso known s terminl nodes) tht denote the result of following comintion of decisions. Dt tht is to e clssified egin t the root node where it is pssed through the vrious decisions in the tree ccording to the vlues of its fetures. The pth tht the dt tkes funnels ech record into lef node, which ssigns it predicted CSE Deprtment,Institute of Technology, Nirm University 3

11 Dt Mining Models And Evlution Technique clss. As the decision tree is essentilly flowchrt, it is prticulrly pproprite for pplictions in which the clssifiction mechnism needs to e trnsprent for legl resons or the results need to e shred in order to fcilitte decision mking. Some potentil uses include: 1. Credit scoring models in which the criteri tht cuses n pplicnt to e rejected need to e well-specified 2. Mrketing studies of customer churn or customer stisfction tht will e shred with mngement or dvertising gencies 3. Dignosis of medicl conditions sed on lortory mesurements, symptoms, or rte of disese progression In spite of their wide pplicility, it is worth noting some scenrios where trees my not e n idel fit. One such cse might e tsk where the dt hs lrge numer of nominl fetures with mny levels or if the dt hs lrge numer of numeric fetures. These cses my result in very lrge numer of decisions nd n overly complex tree. 2.2 Divide nd Conquer Decision trees re uilt using heuristic clled recursive prtitioning. This pproch is generlly known s divide nd conquer ecuse it uses the feture vlues to split the dt into smller nd smller susets of similr clsses. Beginning t the root node, which represents the entire dtset, the lgorithm chooses feture tht is the most predictive of the trget clss. The exmples re then prtitioned into groups of distinct vlues of this feture; this decision forms the first set of tree rnches. The lgorithm continues to divide-nd-conquer the nodes, choosing the est cndidte feture ech time until stopping criterion is reched. This might occur t node if: 1. All (or nerly ll) of the exmples t the node hve the sme clss 2. There re no remining fetures to distinguish mong exmples 3. The tree hs grown to predefined size limit To illustrte the tree uilding process, let s consider simple exmple. Imgine tht we re working for Hollywood film studio, nd our desk is piled high with CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 4

12 Dt Mining Models And Evlution Technique screenplys. Rther thn red ech one cover-to-cover, you decide to develop decision tree lgorithm to predict whether potentil movie would fll into one of three ctegories: minstrem hit, critic s choice, or ox office ust. To gther dt for your model, we turn to the studio rchives to exmine the previous ten yers of movie releses. After reviewing the dt for 3 different movie scripts, pttern emerges. There seems to e reltionship etween the film s proposed shooting udget, the numer of A-list celerities lined up for strring roles, nd the ctegories of success. A sctter plot of this dt might look something like the figure 2.1(Reference [2]): Figure 2.1: Sctter Plot of Budget vs A-List(Ref[2]) Celeerities To uild simple decision tree using this dt, we cn pply divide-nd-conquer strtegy. Let s first split the feture indicting the numer of celerities, prtitioning the movies into groups with nd without low numer of A-list strs (fig 2.2 Reference [2]) Figure 2.2: Split 1: Sctter Plot of Budget vs A-List Celeerities (Ref[2]) CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 5

13 Dt Mining Models And Evlution Technique Next, mong the group of movies with lrger numer of celerities, we cn mke nother split etween movies with nd without high udget(fig2.3) At this point we hve prtitioned the dt into three groups. The group t the top-left corner of the digrm is composed entirely of criticlly-cclimed films. This group is distinguished y high numer of celerities nd reltively low udget. At the top-right corner, the mjority of movies re ox office hits, with high udgets nd lrge numer of celerities. The finl group, which hs little str power ut udgets rnging from smll to lrge, contins the flops. Figure 2.3: Split 2: Sctter Plot of Budget vs A-List Celeerities (Ref[2]) If we wnted, we could continue to divide the dt y splitting it sed on incresingly specific rnges of udget nd celerity counts until ech of the incorrectly clssified vlues resides in its own, perhps tiny prtition. Since the dt cn continue to e split until there re no distinguishing fetures within prtition, decision tree cn e prone to e overfitting for the trining dt with overly-specific decisions. We ll void this y stopping the lgorithm here since more thn 8 percent of the exmples in ech group re from single clss. Our model for predicting the future success of movies cn e represented in simple tree s shown fig 2.4(Ref[2]). To evlute script, follow the rnches through ech decision until its success or filure hs een predicted. In no time, you will e le to clssify the cklog of scripts nd get ck to more importnt work such s writing n wrds cceptnce speech. Since rel-world dt contins more thn two fetures, decision trees quickly ecome fr more complex thn this, with mny more nodes, rnches, nd leves. In the next section we will throw some light on populr lgorithm for uilding decision tree models utomticlly. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 6

14 Dt Mining Models And Evlution Technique Figure 2.4: Decision Tree Model(Refernence[2]) 2.3 C5. Decision Tree Algorithm There re numerous implementtions of decision trees, ut one of the most well known is the C5. lgorithm. This lgorithm ws developed y computer scientist J. Ross Quinln s n improved version of his prior lgorithm, C4.5, which itself is n improvement over his ID3 (Itertive Dichotomiser 3) lgorithm. Strengths of C5. Algorithm 1. An ll-purpose clssifier tht does well on most prolems 2. Highly-utomtic lerningprocess cn hndle numeric or nominl fetures, missing dt 3. Uses only the most importnt fetures 4. Cn e used on dt with reltively few trining exmples or very lrge numer 5. Results in model tht cn e interpreted without mthemticl ckground (for reltively smll trees) 6. More efficient thn other complex models Weknesses of C5. Algorithm CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 7

15 Dt Mining Models And Evlution Technique 1. Decision tree models re often ised towrd splits on fetures hving lrge numer of levels 2. It is esy to overfit or underfit the model 3. Cn hve troule modeling some reltionships due to relince on xisprllel splits 4. Smll chnges in trining dt cn result in lrge chnges to decision logic 5. Lrge trees cn e difficult to interpret nd the decisions they mke my seem counterintuitive 2.4 How To Choose The Best Split? The first chllenge tht decision tree will fce is to identify which feture to split upon. In the previous exmple, we looked for feture vlues tht split the dt in such wy tht prtitions contined exmples primrily of single clss. If the segments of dt contin only single clss, they re considered pure. There re mny different mesurements of purity for identifying splitting criteri C5. uses Entropy for mesuring purity. The entropy of smple of dt indictes how mixed the clss vlues re; the minimum vlue of indictes tht the smple is completely homogenous, while 1 indictes the mximum mount of disorder. The definition of entropy is specified y: Entropy(S) = c p i log 2 (p i ) (2.1) n=1 In the entropy formul, for given segment of dt (S), the term c refers to the numer of different clss levels, nd pi refers to the proportion of vlues flling into clss level i. For exmple, suppose we hve prtition of dt with two clsses: red (6 percent), nd white (4 percent). We cn clculte the entropy s: Entropy(S) =.6 log 2 (.6).4 log 2 (.4) = (2.2) Given this mesure of purity, the lgorithm must still decide which feture to split upon. For this, the lgorithm uses entropy to clculte the chnge in homogeneity resulting from split on ech possile feture. The clcultion is known s informtion gin. The informtion gin for feture F is clculted s the difference etween the entropy in the segment efore the split (S 1 ), nd the prtitions resulting CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 8

16 Dt Mining Models And Evlution Technique from the split (S 2 ): In f ogin(f) = Entropy(S 1 ) Entropy(S 2 ) (2.3) The one compliction is tht fter split, the dt is divided into more thn one prtition. Therefore, the function to clculte Entropy(S 2 ) needs to consider the totl entropy cross ll of the prtitions. It does this y weighing ech prtition s entropy y the proportion of records flling into tht prtition. This cn e stted in formul s: Entropy(S) = n w i log 2 (P i ) (2.4) n=1 In simple terms, the totl entropy resulting from split is the sum of entropy of ech of the n prtitions weighted y the proportion of exmples flling in tht prtition w i. The higher the informtion gin, the etter feture is t creting homogeneous groups fter split on tht feture. If the informtion gin is zero, there is no reduction in entropy for splitting on this feture. On the other hnd, the mximum informtion gin is equl to the entropy prior to the split. This would imply the entropy fter the split is zero, which mens tht the decision results in completely homogeneous groups. The previous formule ssume nominl fetures, ut decision trees use informtion gin for splitting on numeric fetures s well. A common prctice is testing vrious splits tht divide the vlues into groups greter thn or less thn threshold. This reduces the numeric feture into two-level ctegoricl feture nd informtion gin cn e clculted esily. The numeric threshold yielding the lrgest informtion gin is chosen for the split 2.5 Pruning The Decision Tree A decision tree cn continue to grow indefinitely, choosing splitting fetures nd dividing into smller nd smller prtitions until ech exmple is perfectly clssified or the lgorithm runs out of fetures to split on. However, if the tree grows overly lrge, mny of the decisions it mkes will e overly specific nd the model will hve een over fitted to the trining dt. The process of pruning decision tree involves reducing its size such tht it generlizes etter to unseen dt. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 9

17 Dt Mining Models And Evlution Technique One solution to this prolem is to stop the tree from growing once it reches certin numer of decisions or if the decision nodes contin only smll numer of exmples. This is clled erly stopping or pre-pruning the decision tree. As the tree voids doing needless work, this is n ppeling strtegy. However, one downside is tht there is no wy to know whether the tree will miss sutle, ut importnt ptterns tht it would hve lerned hd it grown to lrger size. An lterntive, clled post-pruning involves growing tree tht is too lrge, then using pruning criteri sed on the error rtes t the nodes to reduce the size of the tree to more pproprite level. This is often more effective pproch thn prepruning ecuse it is quite difficult to determine the optiml depth of decision tree without growing it first. Pruning the tree lter on llows the lgorithm to e certin tht ll importnt dt structures were discovered. One of the enefits of the C5. lgorithm is tht it is opinionted out pruningit tkes cre of mny of the decisions, utomticlly using firly resonle defults. Its overll strtegy is to postprune the tree. It first grows lrge tree tht overfits the trining dt. Lter, nodes nd rnches tht hve little effect on the clssifiction errors re removed. In some cses, entire rnches re moved further up the tree or replced y simpler decisions. These processes of grfting rnches re known s sutree rising nd sutree replcement, respectively. Blncing overfitting nd underfitting decision tree is it of n rt, ut if model ccurcy is vitl it my e worth investing some time with vrious pruning options to see if it improves performnce on the test dt. As you will soon see, one of the strengths of the C5. lgorithm is tht it is very esy to djust the trining options. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 1

18 Dt Mining Models And Evlution Technique Chpter 3 Proilistic Lerning - Nive Byesin Clssifiction When meteorologist provides wether forecst, precipittion is typiclly predicted using terms such s 7 percent chnce of rin. These forecsts re known s proility of precipittion reports. Hve you ever considered how they re clculted? It is puzzling question, ecuse in relity, it will either rin or it will not. This chpter covers mchine lerning lgorithm clled nive Byes, which lso uses principles of proility for clssifiction. Just s meteorologists forecst wether, nive Byes uses dt out prior events to estimte the proility of future events. For instnce, common ppliction of nive Byes uses the frequency of words in pst junk emil messges to identify new junk mil. 3.1 Understnding Nive Byesin Clssifiction The sic sttisticl ides necessry to understnd the nive Byes lgorithm hve een round for centuries. The technique descended from the work of the 18th century mthemticin Thoms Byes, who developed foundtionl mthemticl principles (now known s Byesin methods) for descriing the proility of events, nd how proilities should e revised in light of dditionl informtion. Clssifiers sed on Byesin methods utilize trining dt to clculte n oserved proility of ech clss sed on feture vlues. When the clssifier is used lter on unleled dt, it uses the oserved proilities to predict the most likely clss for the new fetures. It s simple ide, ut it results in method tht often hs results on pr with more sophisticted lgorithms. In fct, Byesin clssifiers hve een used for: 1. Text clssifiction, such s junk emil (spm) filtering, uthor identifiction, or topic ctegoriztio CSE Deprtment,Institute of Technology, Nirm University 11

19 Dt Mining Models And Evlution Technique 2. Intrusion detection or nomly detection in computer networks 3. Dignosing medicl conditions, when given set of oserved symptoms Typiclly, Byesin clssifiers re est pplied to prolems in which the informtion from numerous ttriutes should e considered simultneously in order to estimte the proility of n outcome. While mny lgorithms ignore fetures tht hve wek effects, Byesin methods utilize ll ville evidence to sutly chnge the predictions. If lrge numer of fetures hve reltively minor effects, tken together their comined impct could e quite lrge. 3.2 Byes Theorem Byes theorem is nmed fter Thoms Byes, nonconformist English clergymn who did erly work in proility nd decision theory during the 18th century. Let X e dt tuple. In Byesin terms, X is considered evidence. As usul, it is descried y mesurements mde on set of n ttriutes. Let H e some hypothesis, such s tht the dt tuple X elongs to specified clss C. For clssifiction prolems, we wnt to determine P(H X), the proility tht the hypothesis H holds given the evidence or oserved dt tuple X. In other words, we re looking for the proility tht tuple X elongs to clss C, given tht we know the ttriute description of X. P(H X) is the posterior proility, or posterior proility, of H conditioned on X. For exmple, suppose our world of dt tuples is confined to customers descried y the ttriutes ge nd income, respectively, nd tht X is 35-yer-old customer with n income of Rs4,. Suppose tht H is the hypothesis tht our customer will uy computer. Then P(H X) reflects the proility tht customer X will uy computer given tht we know the customers ge nd income. In contrst, P(H) is the prior proility, or prior proility, of H. For our exmple, this is the proility tht ny given customer will uy computer, regrdless of ge, income, or ny other informtion, for tht mtter. The posterior proility, P(H X), is sed on more informtion (e.g., customer informtion) thn the prior proility, P(H), which is independent of X. Similrly, P(X H) is the posterior proility of X conditioned on H. Tht is, it is the proility tht customer, X, is 35 yers old nd erns Rs4,, given tht we know the customer will uy computer. P(X) is the prior proility of X.Using our exmple, it is the proility tht person from our set of customers is 35 yers old nd erns Rs4,. How re these proilities estimted? P(H), P(X H), nd P(X) my e estimted from the given dt, s we shll see elow. Byes theorem is useful in tht it provides wy of clculting the CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 12

20 Dt Mining Models And Evlution Technique posterior proility, P(H X), from P(H), P(X H), nd P(X). Byes theorem is P(H X) = P(X H)P(H) P(X) (3.1) 3.3 The Nive Byes Algorthim The nive Byes (NB) lgorithm descries simple ppliction using Byes theorem for clssifiction. Although it is not the only mchine lerning method utilizing Byesin methods, it is the most common, prticulrly for text clssifiction where it hs ecome the de fcto stndrd. Strengths nd weknesses of this lgorithm re s follows Strength 1. Simple, fst, nd very effective 2. Does well with noisy nd missing dt 3. Requires reltively few exmples for trining, ut lso works well with very lrge numers of exmples 4. Esy to otin the estimted proility for prediction Weknesses 1. Relies on n often-fulty ssumption of eqully importnt nd independent fetures 2. Not idel for dtsets with lrge 3. Estimted proilities re less relile thn the predicted clsses The nive Byes lgorithm is nmed s such ecuse it mkes couple of nive ssumptions out the dt. In prticulr, nive Byes ssumes tht ll of the fetures in the dtset re eqully importnt nd independent. These ssumptions re rrely true in most of the rel-world pplictions. For exmple, if you were ttempting to identify spm y monitoring emil messges, it is lmost certinly true tht some fetures will e more importnt thn others. For exmple, the sender of the emil my e more importnt indictor of spm thn the messge text. Additionlly, the words tht pper in the messge CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 13

21 Dt Mining Models And Evlution Technique ody re not independent from one nother, since the ppernce of some words is very good indiction tht other words re lso likely to pper. A messge with the word Vigr is proly likely to lso contin the words prescription or drugs. Proilistic Lerning Clssifiction Using Nive Byes [ 96 ] However, in most cses when these ssumptions re violted, nive Byes still performs firly well. This is true even in extreme circumstnces where strong dependencies re found mong the fetures. Due to the lgorithm s verstility nd ccurcy cross mny types of conditions, nive Byes is often strong first cndidte for clssifiction lerning tsks. 3.4 Nive Byesin Clssifiction The nve Byesin clssifier, or simple Byesin clssifier, works s follows: 1. Let D e trining set of tuples nd their ssocited clss lels. As usul, ech tuple is represented y n n-dimensionl ttriute vector, X = (x 1,x 2,...,x n ), depicting n mesurements mde on the tuple from n ttriutes, respectively A 1,A 2,...,A n. 2. Suppose tht there re m clsses, C 1,C 2,...,C m. Given tuple, X, the clssifier will predict tht X elongs to the clss hving the highest posterior proility, conditioned on X. Tht is, the nve Byesin clssifier predicts tht tuple X elongs to the clss C i if nd only if P(C i X) > P(C j X) f or1 < j < m, j6 = i : (3.2) Thus we mximize P(C i X). The clss for which P(C i X) is mximized is clled the mximum posterior hypothesis. By Byes theorem P(C i X) = P(X C i) P(C i ) P(X) (3.3) 3. As P(X) is constnt for ll clsses, only P(X C i )P(C i ) need e mximized. If the clss prior proilities re not known, then it is commonly ssumed tht the clsses re eqully likely, tht is, P(C 1 ) = P(C 2 ) == P(C m ), nd we would therefore mximize P(X C i ). Otherwise, we mximize P(X C i )P(C i ). Note tht the clss prior proilities my e estimted y P(C i ) = C i,d / D,where C i,d is the numer of trining tuples of clss C i in D. 4. Given dt sets with mny ttriutes, it would e extremely computtionlly CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 14

22 Dt Mining Models And Evlution Technique expensive to compute P(X C i ). In order to reduce computtion in evluting P(X C i ), the nive ssumption of clss conditionl independence is mde. This presumes tht the vlues of the ttriutes re conditionlly independent of one nother, given the clss lel of the tuple (i.e., tht there re no dependence reltionships mong the ttriutes). Thus, P(X C i ) = k P(x k C i ) (3.4) n=1 5. In order to predict the clss lel of X, P(X C i )P(C i ) is evluted for ech clss C i. The clssifier predicts tht the clss lel of tuple X is the clss C i if nd only if P(X C i )P(C i ) > P(X C j )P(C j ) f or1 < j < m, j! = i (3.5) In other words, the predicted clss lel is the clss C i for which P(X C i )P(C i ) is the mximum. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 15

23 Dt Mining Models And Evlution Technique Chpter 4 Model Evlution Techniques As We now hve in depth explored the two most widely used clssifier models the question which we now fce is how ccurtely these clssifiers cn predict the future trends sed on the dt used for uilding these clssifier viz. How ccurtely customer recomender system of compny cn predict the future purchsing ehvior of the customer sed on the previously recorded sles dt of the customers. Given the signifnct role these clssifiers ply their ccurcy ecomes of prime importnce to the compnies speiclly to those in e-commerce system. Thus the model evlution techniques re employed to evlute the ccurcy of the predictions mde y clssifier model.as different clssifier models hve vrying strengths nd weknesses, it is necessry to use test tht revel distinctions mong the lerners when mesuring how model will perform on future dt.the Succeeding sections in this chpters will primrily focus on the following points. 1. The reson why predictive ccurcy is not sufficient to mesure performnce nd wht re the other lterntives to mesure the ccurcy 2. Methods to ensure tht the performnce mesures resonly reflect model s ility to predict or forecst unseen dt 4.1 Prediction Accurcy The prediction ccurcy of clssifier model is defined proportion of correct predictions y the totl numer of predictions. This numer indictes the percentge of cses in which the lerner is right or wrong. For instnce, suppose clssifier correctly identified whether or not 99,99 out of 1, neworn ies re crriers of tretle ut potentilly-ftl genetic defect. This would imply n ccurcy of percent nd n error rte of only.1 percent. CSE Deprtment,Institute of Technology, Nirm University 16

24 Dt Mining Models And Evlution Technique Although this would pper to indicte n extremely ccurte clssifier, it would e wise to collect dditionl informtion efore trusting your child s life to the test. Wht if the genetic defect is found in only 1 out of every 1, ies? A test tht predicts no defect regrdless of circumstnces will still e correct for percent of ll cses. In this cse, even though the predictions re correct for the lrge mjority of dt, the clssifier is not very useful for its intended purpose, which is to identify children with irth defects. The est mesure of clssifier performnce is whether the clssifier is successful t its intended purpose. For this reson, it is crucil to hve mesures of model performnce tht mesure utility rther thn rw ccurcy 4.2 Confusion Mtrix nd Model Evlution Metrics A confusion mtrix is mtrix tht ctegorizes predictions ccording to whether they mtch the ctul vlue in the dt. One of the tle s dimensions indictes the possile ctegories of predicted vlues while the other dimension indictes the sme for ctul vlues.it cn e n order of n-mtrix depending on the vlues which cn e chieved y the predicted clss.figure 4.1(Refernece [2]) depicts 2x2 nd 3x3 confusion mtrix. There re four importnt terms tht re considered s the uilding Figure 4.1: Confusion Mtrox(Ref[2]) locks used in computing mny evlution mesures.the clss of interest is known s the positive clss, while ll others re known s negtive. 1. True Positives(T P):Correctly clssified s the clss of interest. 2. True Negtives (T N):Correctly clssified s not the clss of interest. 3. Flse Positives(FP):Incorrectly clssified s the clss of interest. 4. Flse Negtives(FN):Incorrectly clssified s not the clss of interest. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 17

25 Dt Mining Models And Evlution Technique The confusion mtrix is useful tool for nlysing how well our clssifier cn recognize tuples of different clsses. TP nd TN tell us when the clssifier is getting things right, while FP nd FN tell us when the clssifier is getting things wrong.given m clsses, confusion mtrix is mtrix of tlest m y m size An entry, CM i j in the first m rows nd m columns indictes the numer of tuples of clss i tht were leled y the clssifier s clss j. For clssifier to hve good ccurcy, idelly most of the tuples would e represented long the digonl of the confusion mtrix from the entry CM 1,1 to entry CM m,m, with the rest of the entries eing zero or close to zero. Tht is idelly, FP nd FN re round zero. Accurcy: The ccurcy of clssifier on given test set is the percentge of test tuples tht re correctly clssified y the clssifier. ccurcy = T P + T N P + N (4.1) Error Rte: Error Rte or miss clssifiction rte of clssifier, M, which is simply 1 ccurcy(m), where ccurcy (M) is the ccurcy of M. errorrte = FP + FN P + N (4.2) If we use the trining set insted of test set to estimte the error rte of model, this quntity is known s the re-sustituion error.this error estimte is optimistic of the true error rte ecuse the model is not tested on ny smples tht it hs not lredy seen. The Clss Imlnce Prolem: the dtsets where the min clss of interest is rre. Tht is the dt set distriution reflects significnt mjority of the negtive clss nd minority positive clss.for exmple in frud detection pplictions, the clss of interest frudulent clss is rre or less frequently occurring in comprison to the negtive non-frudulent clss.in medicl dt there my e rre clss, such s cncer. Suppose tht we hve trined clssifier to clssify medicl dt tuples,where the clss lel ttriute is cncer nd the possile clss vlues re yes nd no. An ccurcy rte of sy 97% my mke the clssifier seem quite ccurte, ut wht if only, sy 3% of the trining tuples re ctully cncer? Clerly n ccurcy rte of 97% my not e cceptle- the clssifier could e correctly leling only the non-cncer tuples, for instnce,nd miss clssifying ll the cncer tuples. Insted we need other mesures which ccess how well the clssifier cn recognize CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 18

26 Dt Mining Models And Evlution Technique the positive tuples nd how well it cn recognize the negtive tuples Sensitivity nd Specificity:Clssifiction often involves lnce etween eing overly conservtive nd overly ggressive in decision mking. For exmple, n e-mil filter could gurntee to eliminte every spm messge y ggressively eliminting nerly every hm messge t the sme time. On the other hnd, gurntee tht no hm messges will e indvertently filtered might llow n uncceptle mount of spm to pss through the filter. This trdeoff is cptured y pir of mesures: sensitivity nd specificity. The sensitivity of model (lso clled the true positive rte), mesures the proportion of positive exmples tht were correctly clssified. Therefore, s shown in the following formul, it is clculted s the numer of true positives divided y the totl numer of positives in the dtthose correctly clssified (the true positives), s well s those incorrectly clssified (the flse negtives). sensitivity = T P T P + FN (4.3) The specificity of model (lso clled the true negtive rte), mesures the proportion of negtive exmples tht were correctly clssified. As with sensitivity, this is computed s the numer of true negtives divided y the totl numer of negtivesthe true negtives plus the flse positives. speci f icity = T N T N + FP (4.4) Precision nd recll: Closely relted to sensitivity nd specificity re two other performnce mesures, relted to compromises mde in clssifiction: precision nd recll. Used primrily in the context of informtion retrievl, these sttistics re intended to provide n indiction of how interesting nd relevnt model s results re, or whether the predictions re diluted y meningless noise. The precision (lso known s the positive predictive vlue) is defined s the proportion of positive exmples tht re truly positive; in other words, when model predicts the positive clss, how often is it correct? A precise model will only predict the positive clss in cses very likely to e positive. It will e very trustworthy. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 19

27 Dt Mining Models And Evlution Technique Consider wht would hppen if the model ws very imprecise. Over time, the results would e less likely to e trusted. In the context of informtion retrievl, this would e similr to serch engine such s Google returning unrelted results. Eventully users would switch to competitor such s Bing. In the cse of the SMS spm filter, high precision mens tht the model is le to crefully trget only the spm while ignoring the hm. precision = T P T P + FP (4.5) On the other hnd, recll is mesure of how complete the results re. As shown in the following formul, this is defined s the numer of true positives over the totl numer of positives. We my recognize tht this is the sme s sensitivity, only the interprettion differs. A model with high recll cptures lrge portion of the positive exmples, mening tht it hs wide redth. For exmple, serch engine with high recll returns lrge numer of documents pertinent to the serch query. Similrly, the SMS spm filter hs high recll if the mjority of spm messges re correctly identified. recll = T P T P + FN (4.6) The F-Mesure: A mesure of model performnce tht comines precision nd recll into single numer is known s the F-mesure (lso sometimes clled the F1 score or the F-score). The F-mesure comines precision nd recll using the hrmonic men. The hrmonic men is used rther thn the more common rithmetic men since oth precision nd recll re expressed s proportions etween zero nd one. The following is the formul for F-mesure: F Mesure = 2 precision recll recll + precision (4.7) F β = (1 + β2 ) precision recll β 2 precision + recll (4.8) In ddition to ccurcy-sed mesures, clssifiers cn lso e compred with respect to the following dditionl spects: 1. Speed: This refers to the computtionl costs involved in generting nd using the given clssifier 2. Roustness: This is the ility of the clssifier to mke correct prediction given noisy dt or dt with missing vlues.roustness is typiclly ssessed CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 2

28 Dt Mining Models And Evlution Technique with series of synthetic dt sets represeting incresing degress of noise nd missing vlues. 3. Sclility: This refers to the ility to construct the clssifier efficiently given lrge mounts of dt. Sclility is typiclly ssessed with series of dt sets of incresing size. 4. Interpretility : This refers to the level of understnding nd insight tht is provided y the clssifier or predictor. Interpretility is sujective nd therefore more difficult to sses CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 21

29 Dt Mining Models And Evlution Technique 4.3 How To Estimte These Metrics? We cn use following methods to estimte the evlution metrics explined indepth in the preceding sections:. Trining dt. Independent test dt c. Hold-out method d. k-fold cross-vlidtion method e. Leve-one-out method f. Bootstrp method g. Compring Two Models Trining nd Independent Test Dt The ccurcy/error estimtes on the trining dt re not good indictors of performnce on future dt. Becuse new dt will proly not e exctly the sme s the trining dt.the ccurcy/error estimtes on the trining dt mesure the degree of clssifiers over-fitting.fig 4.2 depicts use of trining set Estimtion with Figure 4.2: Trining Set independent test dt (figure 4.3)is used when we hve plenty of dt nd there is nturl wy to forming trining nd test dt. For exmple: Quinln in 1987 reported experiments in medicl domin for which the clssifiers were trined on dt from 1985 nd tested on dt from Figure 4.3: Trining nd Test Set CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 22

30 Dt Mining Models And Evlution Technique Figure 4.4: Clssifiction: Trin, Vlidtion, Test Split Reference[3] Holdout Method The holdout method(fig4.5) is wht we hve lluded to so fr in our discussions out ccurcy. In this method, the given dt re rndomly prtitioned into two independent sets, trining set nd test set. Typiclly, two-thirds of the dt re llocted to the trining set, nd the remining one-third is llocted to the test set. The trining set is used to derive the model, whose ccurcy is estimted with the test set. The estimte is pessimistic ecuse only portion of the initil dt is used to derive the model. The hold-out method is usully used when we hve thousnds Figure 4.5: Holdout Method CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 23

31 Dt Mining Models And Evlution Technique of instnces, including severl hundred instnces from ech clss. For unlnced dt-sets, smples might not e representtive.few or no instnces of some clsses will e there in cse of clss imlnced dt where one clss is in mjority viz. frudulent trnsction detection nd Medicl Dignostic Tests. To mke the smple Representtive for holdout we use the concept of strtifiction in this we ensure tht ech clss gets equl representtion ccording to their proportion in ctul dt-set Rndom su-smpling is vrition of the holdout method in which the holdout method is repeted k times.in ech itertion, certin proportion is rndomly selected for trining (possily with strtifiction). The error rtes on the different itertions re verged to yield n overll error rte.it is lso known s repeted holdout method K-Cross-vlidtion In k-fold cross-vlidtion, (fig 4.6) the initil dt re rndomly prtitioned into k mutully exclusive susets or folds, D 1,D 2,...,Dk ech of pproximtely equl size. Trining nd testing is performed k times. In itertion i, prtition D i is reserved s the test set, nd the remining prtitions re collectively used to trin the model. Tht is, in the first itertion, susets D 2,..,D k collectively serve s the trining set in order to otin first model, which is tested on D 1 ; the second itertion is trined on susets D 1,D 3,...,D k nd tested on D 2 ; nd so on. Unlike the holdout nd rndom susmpling methods ove, here, ech smple is used the sme numer of times for trining nd once for testing. For clssifiction, the ccurcy estimte is the overll numer of correct clssifictions from the k itertions, divided y the totl numer of tuples in the initil dt. For prediction, the error estimte cn e computed s the totl loss from the k itertions, divided y the totl numer of initil tuples. Leve one out CV Leve-one-out is specil cse of k-fold cross-vlidtion where k is set to the numer of initil tuples. Tht is, only one smple is left out t time for the test set. Some fetures of Leve one out CV re 1. Mkes est use of the dt. 2. Involves no rndom su-smpling. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 24

32 Dt Mining Models And Evlution Technique Figure 4.6: k-cross-vlidtion Disdvntges of Leve one out CV: 1. A disdvntge of Leve-One-Out-CV is tht strtifiction is not possile: 2. Very computtionlly expensive Bootstrp Cross vlidtion uses smpling without replcement. The sme instnce, once selected, cn not e selected gin for prticulr trining/test set.the The ootstrp uses smpling with replcement to form the trining set: 1. Smple dtset of n instnces n times with replcement to form new dtset of n instnces. 2. Use this dt s the trining set. 3. Use the instnces from the originl dtset tht dont occur in the new trining set for testing. 4. A prticulr instnce hs proility of 1n 1 of not eing picked Thus its proility of ending up in the test dt is (where n tends to infinity): (1 1 n )n = e 1 =.368 (4.9) 5. This mens the trining dt will contin pproximtely 63.2% of the instnces nd the test dt will contin pproximtely 36.8% of the instnces. 6. The error estimte on the test dt will e very pessimistic ecuse the clssifier is trined on just 63% of the instnces. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 25

33 Dt Mining Models And Evlution Technique 7. Therefore, comine it with the trining error: err =.632.e e (4.1) 8. The trining error gets less weight thn the error on the test dt Compring Two Clssifier Models Suppose tht we hve generted two models, M1 nd M2 (for either clssifiction or prediction), from our dt. We hve performed 1-fold cross-vlidtion to otin men error rte for ech. How cn we determine which model is est? It my seem intuitive to select the model with the lowest error rte, however, the men error rtes re just estimtes of error on the true popultion of future dt cses. There cn e considerle vrince etween error rtes within ny given 1-fold cross-vlidtion experiment. Although the men error rtes otined for M1 nd M2 my pper different, tht difference my not e sttisticlly significnt. Wht if ny difference etween the two my just e ttriuted to chnce? following points explin in detil how sttisticlly significnt is theirr difference 1. Assume tht we hve two clssifiers, M 1 nd M 2, nd we would like to know which one is etter for clssifiction prolem. 2. We test the clssifiers on n test dt sets D 1,D 2,,D n nd we receive error rte estimtes e 11,e 12,,e 1n for clssifier M 1 nd error rte estimtes e21,e 22,,e 2n for clssifier M Using rte estimtes we cn compute the men error rte e1 for clssifier M 1 nd the men error rte e 2 for clssifier M These men error rtes re just estimtes of error on the true popultion of future dt cses. 5. We note tht error rte estimtes e 11,e 12,,e1n for clssifier M 1 nd error rte estimtes e 21,e 22,,e 2n for clssifier M 2 re pired. Thus, we consider the differences d 1,d 2,,d n where d j = e 1 j e 2 j. 6. The differences d 1,d 2,,d n re instntitions of n rndom vriles D 1,D 2,,D n with men D nd stndrd devition D. 7. We need to estlish confidence intervls for D in order to decide whether the difference in the generliztion performnce of the clssifiers M 1 nd M 2 is sttisticlly significnt or not. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 26

34 Dt Mining Models And Evlution Technique 8. Since the stndrd devition D is unknown, we pproximte it using the smple stndrd devition s d : n 1 s d = n [(e 1i e 2i ) (e 1 e 2 )] 2 (4.11) i=1 9. T-sttistics T = D µd s d n (4.12) 1. The T sttistics is governed y t-distriution with n - 1 degrees of freedom. Figure 4.7 shows t-distriution curve (Refernce[4]) Figure 4.7: t-distriution curve (Reference [4]) 11. If d nd sd re the men nd stndrd devition of the normlly distriuted differences of n rndom pirs of errors, (1 )1% confidence intervl for D = 1-2 is : d m t α1/2 s d n < µ d < d m +t α1/2 s d n (4.13) where t α/2 is the t-vlue with v = n 1 degrees of freedom, leving n re of α/2 to the right. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 27

35 Dt Mining Models And Evlution Technique 12. If t > z or t < z t doesn t lie in the rejection region, within the tils of the distriution. This mens tht we cn reject the null hypothesis tht the mens of M 1 nd M 2 re the sme nd conclude tht there is sttisticlly significnt difference etween the two models. Otherwise, if we cnnot reject the null hypothesis, we then conclude tht ny difference etween M 1 nd M 2 cn e ttriuted to chnce. 4.4 ROC Curves The ROC curve (Receiver Operting Chrcteristic) is commonly used to exmine the trdeoff etween the detection of true positives, while voiding the flse positives. As you might suspect from the nme, ROC curves were developed y engineers in the field of communictions round the time of World Wr II; receivers of rdr nd rdio signls needed method to discriminte etween true signls nd flse lrms. The sme technique is useful tody for visulizing the efficcy of mchine lerning models. The chrcteristics of typicl ROC digrm re depicted in the following plot(figure 4.8 Reference[2]). Curves re defined on plot with the proportion of true positives on the verticl xis, nd the proportion of flse positives on the horizontl xis. Becuse these vlues re equivlent to sensitivity nd (1 specificity), respectively, the digrm is lso known s sensitivity/specificity plot: Figure 4.8: ROC curves (Reference[2]) CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 28

36 Dt Mining Models And Evlution Technique The points comprising ROC curves indicte the true positive rte t vrying flse positive thresholds. To crete the curves, clssifier s predictions re sorted y the model s estimted proility of the positive clss, with the lrgest vlues first. Beginning t the origin, ech prediction s impct on the true positive rte nd flse positive rte will result in curve trcing verticlly (for correct prediction), or horizontlly (for n incorrect prediction). To illustrte this concept, three hypotheticl clssifiers re contrsted in the previous plot. First, the digonl line from the ottom-left to the top-right corner of the digrm represents clssifier with no predictive vlue. This type of clssifier detects true positives nd flse positives t exctly the sme rte, implying tht the clssifier cnnot discriminte etween the two. This is the seline y which other clssifiers my e judged; ROC curves flling close to this line indicte models tht re not very useful. Similrly, the perfect clssifier hs curve tht psses through the point t 1 percent true positive rte nd percent flse positive rte. It is le to correctly identify ll of the true positives efore it incorrectly clssifies ny negtive result. Most rel-world clssifiers re similr to the test clssifier; they fll somewhere in the zone etween perfect nd useless. The closer the curve is to the perfect clssifier, the etter it is t identifying positive vlues. This cn e mesured using sttistic known s the re under the ROC curve (revited AUC). The AUC, s you might expect, trets the ROC digrm s two-dimensionl squre nd mesures the totl re under the ROC curve. AUC rnges from.5 (for clssifier with no predictive vlue), to 1. (for perfect clssifier). A convention for interpreting AUC scores uses system similr to cdemic letter grdes: = A (outstnding) = B (excellent/good) = C (cceptle/fir) = D (poor) = F (no discrimintion) As with most scles similr to this, the levels my work etter for some tsks thn others; the ctegoriztion is somewht sujective. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 29

37 Dt Mining Models And Evlution Technique 4.5 Ensemle Methods Motivtion 1. Ensemle model improves ccurcy nd roustness over single model methods 2. Applictions: () distriuted computing () privcy-preserving pplictions (c) lrge-scle dt with reusle models (d) multiple sources of dt 3. Efficiency: complex prolem cn e decomposed into multiple su-prolems tht re esier to understnd nd solve (divide-nd-conquer pproch) Why Ensemle Works? 1. Intuition comining diverse, independent opinions in humn decision-mking s protective mechnism e.g. stock portfolio 2. Overcome limittions of single hypothesis The trget function my not e implementle with individul clssifiers, ut my e pproximted y model verging 3. Gives glol picture Figure 4.9: Ensemle Gives Glol picture CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 3

38 Dt Mining Models And Evlution Technique Ensemle Works in Two Wys 1. Lern to Comine Figure 4.1: Lern to Comine(Reference[3]) 2. Lern By Consensus Figure 4.11: Lern By Consensus(Refrence[3]) CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 31

39 Dt Mining Models And Evlution Technique Lern To Comine Pros 1. Get useful feed cks from leled dt. 2. Cn potentilly improve ccurcy. Cons 1. Need to keep the leled dt to trin the ensemle 2. My overfit the leled dt. 3. Cnnot work when no lels re ville Lern By Consensus Pros 1. Do not need leled dt. 2. Cn improve the generliztion performnce. Cons 1. No feedcks from the leled dt. 2. Require the ssumption tht consensus is etter Bgging Given set, D, of d tuples, gging works s follows. For itertion i(i = 1,2,...,k), trining set,di, of d tuples is smpled with replcement fromthe originl set of tuples,d. The term gging stnds for ootstrp ggregtion.ech trining set is ootstrp smple. Becuse smpling with replcement is used, some of the originl tuples of D my not e included in D i,wheres other smy occur more thn once. A clssifier model, M i, is lerned for ech trining set, D i. To clssify n unknown tuple, X, ech clssifier, M i, returns its clss prediction, which counts s one vote. The gged clssifier, M, counts the votes nd ssigns the clss with the most votes to X. Bgging cn e pplied to the prediction of continuous vlues y tking the verge vlue of ech prediction for given test tuple.. The gged clssifier often hs significntly greter ccurcy thn single clssifier derived from D, the originl trining dt. It will not e considerly worse nd is more roust to the effects of noisy dt. The incresed ccurcy occurs ecuse the composite model reduces CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 32

40 Dt Mining Models And Evlution Technique the vrince of the individul clssifiers. For prediction, it ws theoreticlly proven tht gged predictor will lwys hve improved ccurcy over single predictor derived from D. Alogrithm The gging lgorithmcrete n ensemle of models (clssifiers or predictors) for lerning scheme where ech model gives n eqully-weighted prediction. Input: D, set of d trining tuples; k, the numer of models in the ensemle; lerning scheme (e.g., decision tree lgorithm, ck-propgtion, etc.) Output: A composite model, M. Method: (1) for i = 1tokdo // crete k models: (2) crete ootstrp smple, D i, y smpling D with replcement; (3) use D i to derive model, M i ; (4) end for To use the composite model on tuple, X (1) if clssifiction then (2) let ech of the k models clssify X nd return the mjority vote (3) if prediction then (4) let ech of the k models predict vlue for X nd return the verge predicted vlue; Boosting Principles 1. Boost set of wek lerners to strong lerner 2. An itertive procedure to dptively chnge distriution of trining dt y focusing more on previously misclssified records 3. Initilly, ll N records re ssigned equl weights Unlike gging, weights my chnge t the end of oosting round 4. Records tht re wrongly clssified will hve their weights incresed 5. Records tht re clssified correctly will hve their weights decresed 6. Equl weights re ssigned to ech trining tuple (1/d for round 1) 7. After clssifier M i is lerned, the weights re djusted to llow the susequent clssifier M i+1 to py more ttention to tuples tht were misclssified y M i. CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 33

41 Dt Mining Models And Evlution Technique 8. Finl oosted clssifier M comines the votes of ech individul clssifier Weight of ech clssifiers vote is function of its ccurcy 9. Adoost populr oosting lgorithm Adoost-Boosting Algorithm Input: 1). Trining set D contining d tuples 2). k rounds 3). A clssifiction lerning scheme Output: A composite model Method: 1. Dt set D contining d clss-leled tuples (X 1,y 1 ),(X 2,y 2 ),(X 3,y 3 ),.(X d,y d ) 2. Initilly ssign equl weight 1/d to ech tuple 3. To generte k se clssifiers, we need k rounds or 4. itertions Round i, tuples from D re smpled with replcement, to form D i (size d) 5. Ech tuples chnce of eing selected depends on its weight 6. Bse clssifier M i, is derived from trining tuples of D i 7. Error of M i is tested using D i ]item Weights of trining tuples re djusted depending on how they were clssified Correctly clssified: Decrese weight Incorrectly clssified: Increse weight 8. Weight of tuple indictes how hrd it is to clssify it (directly proportionl) 9. Some clssifiers my e etter t clssifying some hrd tuples thn others 1. We finlly hve series of clssifiers tht complement ech other 11. Error Estimte: error(m i ) = d j where err(x j ) is the misclssifiction error for X j (= 1) 12. If clssifier error exceeds.5, we ndon it w j err(x j ) (4.14) 13. Try gin with new D i nd new M i derived from it error (Mi) ffects how the weights of trining tuples re updted CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 34

42 Dt Mining Models And Evlution Technique 14. If tuple is correctly clssified in round i, its weight is multiplied y error(m i ) 1 erro(m i ) (4.15) 15. Adjust weights of ll correctly clssified tuples 16. Now weights of ll tuples (including the misclssified tuples) re normlized n f = sumo f oldweights sumo f newweights (4.16) 17. Weight of clssifier Mis weight is log error(m i) 1 erro(m i ) 18. The lower clssifier error rte, the more ccurte it is, nd therefore, the higher its weight for voting should e 19. Weight of clssifier Mis vote is log error(m i) 1 erro(m i ) 2. For ech clss c, sum the weights of ech clssifier tht ssigned clss c to X (unseen tuple) 21. The clss with the highest sum is the WINNER! CSE Deprtment, Institute Of Technology,Nirm University, Ahmedd 35

43 Dt Mining Models And Evlution Technique Chpter 5 Conclusion nd Future Scope 5.1 Comprtive Study To prticlly explore the theoreticl spects of the dt mining models nd the techniques to evlute them, we conducted smll scle explortory study in dt mining tool Wek - developed y University of Wikdo, Newzelnd.The following tles summrize the result of our explortory study Figure 5.1: Wek Screen Shots CSE Deprtment,Institute of Technology, Nirm University 36

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17 EECS 70 Discrete Mthemtics nd Proility Theory Spring 2013 Annt Shi Lecture 17 I.I.D. Rndom Vriles Estimting the is of coin Question: We wnt to estimte the proportion p of Democrts in the US popultion,

More information

Parse trees, ambiguity, and Chomsky normal form

Parse trees, ambiguity, and Chomsky normal form Prse trees, miguity, nd Chomsky norml form In this lecture we will discuss few importnt notions connected with contextfree grmmrs, including prse trees, miguity, nd specil form for context-free grmmrs

More information

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17 CS 70 Discrete Mthemtics nd Proility Theory Summer 2014 Jmes Cook Note 17 I.I.D. Rndom Vriles Estimting the is of coin Question: We wnt to estimte the proportion p of Democrts in the US popultion, y tking

More information

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4 Intermedite Mth Circles Wednesdy, Novemer 14, 2018 Finite Automt II Nickols Rollick nrollick@uwterloo.c Regulr Lnguges Lst time, we were introduced to the ide of DFA (deterministic finite utomton), one

More information

Classification Part 4. Model Evaluation

Classification Part 4. Model Evaluation Clssifiction Prt 4 Dr. Snjy Rnk Professor Computer nd Informtion Science nd Engineering University of Florid, Ginesville Model Evlution Metrics for Performnce Evlution How to evlute the performnce of model

More information

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary Outline Genetic Progrmming Evolutionry strtegies Genetic progrmming Summry Bsed on the mteril provided y Professor Michel Negnevitsky Evolutionry Strtegies An pproch simulting nturl evolution ws proposed

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

Bayesian Networks: Approximate Inference

Bayesian Networks: Approximate Inference pproches to inference yesin Networks: pproximte Inference xct inference Vrillimintion Join tree lgorithm pproximte inference Simplify the structure of the network to mkxct inferencfficient (vritionl methods,

More information

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique? XII. LINEAR ALGEBRA: SOLVING SYSTEMS OF EQUATIONS Tody we re going to tlk out solving systems of liner equtions. These re prolems tht give couple of equtions with couple of unknowns, like: 6= x + x 7=

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

Convert the NFA into DFA

Convert the NFA into DFA Convert the NF into F For ech NF we cn find F ccepting the sme lnguge. The numer of sttes of the F could e exponentil in the numer of sttes of the NF, ut in prctice this worst cse occurs rrely. lgorithm:

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Designing Information Devices and Systems I Spring 2018 Homework 7

Designing Information Devices and Systems I Spring 2018 Homework 7 EECS 16A Designing Informtion Devices nd Systems I Spring 2018 omework 7 This homework is due Mrch 12, 2018, t 23:59. Self-grdes re due Mrch 15, 2018, t 23:59. Sumission Formt Your homework sumission should

More information

Interpreting Integrals and the Fundamental Theorem

Interpreting Integrals and the Fundamental Theorem Interpreting Integrls nd the Fundmentl Theorem Tody, we go further in interpreting the mening of the definite integrl. Using Units to Aid Interprettion We lredy know tht if f(t) is the rte of chnge of

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

Lecture 3: Equivalence Relations

Lecture 3: Equivalence Relations Mthcmp Crsh Course Instructor: Pdric Brtlett Lecture 3: Equivlence Reltions Week 1 Mthcmp 2014 In our lst three tlks of this clss, we shift the focus of our tlks from proof techniques to proof concepts

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3 2 The Prllel Circuit Electric Circuits: Figure 2- elow show ttery nd multiple resistors rrnged in prllel. Ech resistor receives portion of the current from the ttery sed on its resistnce. The split is

More information

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata CS103B ndout 18 Winter 2007 Ferury 28, 2007 Finite Automt Initil text y Mggie Johnson. Introduction Severl childrens gmes fit the following description: Pieces re set up on plying ord; dice re thrown or

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

Continuous Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

Continuous Random Variables Class 5, Jeremy Orloff and Jonathan Bloom Lerning Gols Continuous Rndom Vriles Clss 5, 8.05 Jeremy Orloff nd Jonthn Bloom. Know the definition of continuous rndom vrile. 2. Know the definition of the proility density function (pdf) nd cumultive

More information

1B40 Practical Skills

1B40 Practical Skills B40 Prcticl Skills Comining uncertinties from severl quntities error propgtion We usully encounter situtions where the result of n experiment is given in terms of two (or more) quntities. We then need

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

Section 6: Area, Volume, and Average Value

Section 6: Area, Volume, and Average Value Chpter The Integrl Applied Clculus Section 6: Are, Volume, nd Averge Vlue Are We hve lredy used integrls to find the re etween the grph of function nd the horizontl xis. Integrls cn lso e used to find

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

For the percentage of full time students at RCC the symbols would be:

For the percentage of full time students at RCC the symbols would be: Mth 17/171 Chpter 7- ypothesis Testing with One Smple This chpter is s simple s the previous one, except it is more interesting In this chpter we will test clims concerning the sme prmeters tht we worked

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz University of Southern Cliforni Computer Science Deprtment Compiler Design Fll Lexicl Anlysis Smple Exercises nd Solutions Prof. Pedro C. Diniz USC / Informtion Sciences Institute 4676 Admirlty Wy, Suite

More information

5: The Definite Integral

5: The Definite Integral 5: The Definite Integrl 5.: Estimting with Finite Sums Consider moving oject its velocity (meters per second) t ny time (seconds) is given y v t = t+. Cn we use this informtion to determine the distnce

More information

Classification: Rules. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo regionale di Como

Classification: Rules. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo regionale di Como Metodologie per Sistemi Intelligenti Clssifiction: Prof. Pier Luc Lnzi Lure in Ingegneri Informtic Politecnico di Milno Polo regionle di Como Rules Lecture outline Why rules? Wht re clssifiction rules?

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

CHAPTER 1 PROGRAM OF MATRICES

CHAPTER 1 PROGRAM OF MATRICES CHPTER PROGRM OF MTRICES -- INTRODUCTION definition of engineering is the science y which the properties of mtter nd sources of energy in nture re mde useful to mn. Thus n engineer will hve to study the

More information

Non-Linear & Logistic Regression

Non-Linear & Logistic Regression Non-Liner & Logistic Regression If the sttistics re boring, then you've got the wrong numbers. Edwrd R. Tufte (Sttistics Professor, Yle University) Regression Anlyses When do we use these? PART 1: find

More information

Bases for Vector Spaces

Bases for Vector Spaces Bses for Vector Spces 2-26-25 A set is independent if, roughly speking, there is no redundncy in the set: You cn t uild ny vector in the set s liner comintion of the others A set spns if you cn uild everything

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique? XII. LINEAR ALGEBRA: SOLVING SYSTEMS OF EQUATIONS Tody we re going to tlk bout solving systems of liner equtions. These re problems tht give couple of equtions with couple of unknowns, like: 6 2 3 7 4

More information

The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms

The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms The Minimum Lel Spnning Tree Prolem: Illustrting the Utility of Genetic Algorithms Yupei Xiong, Univ. of Mrylnd Bruce Golden, Univ. of Mrylnd Edwrd Wsil, Americn Univ. Presented t BAE Systems Distinguished

More information

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007 A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H Thoms Shores Deprtment of Mthemtics University of Nebrsk Spring 2007 Contents Rtes of Chnge nd Derivtives 1 Dierentils 4 Are nd Integrls 5 Multivrite Clculus

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations. Lecture 3 3 Solving liner equtions In this lecture we will discuss lgorithms for solving systems of liner equtions Multiplictive identity Let us restrict ourselves to considering squre mtrices since one

More information

Section 4: Integration ECO4112F 2011

Section 4: Integration ECO4112F 2011 Reding: Ching Chpter Section : Integrtion ECOF Note: These notes do not fully cover the mteril in Ching, ut re ment to supplement your reding in Ching. Thus fr the optimistion you hve covered hs een sttic

More information

7.1 Integral as Net Change and 7.2 Areas in the Plane Calculus

7.1 Integral as Net Change and 7.2 Areas in the Plane Calculus 7.1 Integrl s Net Chnge nd 7. Ares in the Plne Clculus 7.1 INTEGRAL AS NET CHANGE Notecrds from 7.1: Displcement vs Totl Distnce, Integrl s Net Chnge We hve lredy seen how the position of n oject cn e

More information

Chapter 4: Techniques of Circuit Analysis. Chapter 4: Techniques of Circuit Analysis

Chapter 4: Techniques of Circuit Analysis. Chapter 4: Techniques of Circuit Analysis Chpter 4: Techniques of Circuit Anlysis Terminology Node-Voltge Method Introduction Dependent Sources Specil Cses Mesh-Current Method Introduction Dependent Sources Specil Cses Comprison of Methods Source

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

A study of Pythagoras Theorem

A study of Pythagoras Theorem CHAPTER 19 A study of Pythgors Theorem Reson is immortl, ll else mortl. Pythgors, Diogenes Lertius (Lives of Eminent Philosophers) Pythgors Theorem is proly the est-known mthemticl theorem. Even most nonmthemticins

More information

Things to Memorize: A Partial List. January 27, 2017

Things to Memorize: A Partial List. January 27, 2017 Things to Memorize: A Prtil List Jnury 27, 2017 Chpter 2 Vectors - Bsic Fcts A vector hs mgnitude (lso clled size/length/norm) nd direction. It does not hve fixed position, so the sme vector cn e moved

More information

Student Activity 3: Single Factor ANOVA

Student Activity 3: Single Factor ANOVA MATH 40 Student Activity 3: Single Fctor ANOVA Some Bsic Concepts In designed experiment, two or more tretments, or combintions of tretments, is pplied to experimentl units The number of tretments, whether

More information

The practical version

The practical version Roerto s Notes on Integrl Clculus Chpter 4: Definite integrls nd the FTC Section 7 The Fundmentl Theorem of Clculus: The prcticl version Wht you need to know lredy: The theoreticl version of the FTC. Wht

More information

Review of Gaussian Quadrature method

Review of Gaussian Quadrature method Review of Gussin Qudrture method Nsser M. Asi Spring 006 compiled on Sundy Decemer 1, 017 t 09:1 PM 1 The prolem To find numericl vlue for the integrl of rel vlued function of rel vrile over specific rnge

More information

Quadratic Forms. Quadratic Forms

Quadratic Forms. Quadratic Forms Qudrtic Forms Recll the Simon & Blume excerpt from n erlier lecture which sid tht the min tsk of clculus is to pproximte nonliner functions with liner functions. It s ctully more ccurte to sy tht we pproximte

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true. York University CSE 2 Unit 3. DFA Clsses Converting etween DFA, NFA, Regulr Expressions, nd Extended Regulr Expressions Instructor: Jeff Edmonds Don t chet y looking t these nswers premturely.. For ech

More information

Designing finite automata II

Designing finite automata II Designing finite utomt II Prolem: Design DFA A such tht L(A) consists of ll strings of nd which re of length 3n, for n = 0, 1, 2, (1) Determine wht to rememer out the input string Assign stte to ech of

More information

2.4 Linear Inequalities and Interval Notation

2.4 Linear Inequalities and Interval Notation .4 Liner Inequlities nd Intervl Nottion We wnt to solve equtions tht hve n inequlity symol insted of n equl sign. There re four inequlity symols tht we will look t: Less thn , Less thn or

More information

Designing Information Devices and Systems I Spring 2018 Homework 8

Designing Information Devices and Systems I Spring 2018 Homework 8 EECS 16A Designing Informtion Devices nd Systems I Spring 2018 Homework 8 This homework is due Mrch 19, 2018, t 23:59. Self-grdes re due Mrch 22, 2018, t 23:59. Sumission Formt Your homework sumission

More information

Section 6.1 Definite Integral

Section 6.1 Definite Integral Section 6.1 Definite Integrl Suppose we wnt to find the re of region tht is not so nicely shped. For exmple, consider the function shown elow. The re elow the curve nd ove the x xis cnnot e determined

More information

Improper Integrals. The First Fundamental Theorem of Calculus, as we ve discussed in class, goes as follows:

Improper Integrals. The First Fundamental Theorem of Calculus, as we ve discussed in class, goes as follows: Improper Integrls The First Fundmentl Theorem of Clculus, s we ve discussed in clss, goes s follows: If f is continuous on the intervl [, ] nd F is function for which F t = ft, then ftdt = F F. An integrl

More information

Fast Frequent Free Tree Mining in Graph Databases

Fast Frequent Free Tree Mining in Graph Databases The Chinese University of Hong Kong Fst Frequent Free Tree Mining in Grph Dtses Peixing Zho Jeffrey Xu Yu The Chinese University of Hong Kong Decemer 18 th, 2006 ICDM Workshop MCD06 Synopsis Introduction

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2.

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2. Mth 43 Section 6. Section 6.: Definite Integrl Suppose we wnt to find the re of region tht is not so nicely shped. For exmple, consider the function shown elow. The re elow the curve nd ove the x xis cnnot

More information

CM10196 Topic 4: Functions and Relations

CM10196 Topic 4: Functions and Relations CM096 Topic 4: Functions nd Reltions Guy McCusker W. Functions nd reltions Perhps the most widely used notion in ll of mthemtics is tht of function. Informlly, function is n opertion which tkes n input

More information

Week 10: Line Integrals

Week 10: Line Integrals Week 10: Line Integrls Introduction In this finl week we return to prmetrised curves nd consider integrtion long such curves. We lredy sw this in Week 2 when we integrted long curve to find its length.

More information

Name Ima Sample ASU ID

Name Ima Sample ASU ID Nme Im Smple ASU ID 2468024680 CSE 355 Test 1, Fll 2016 30 Septemer 2016, 8:35-9:25.m., LSA 191 Regrding of Midterms If you elieve tht your grde hs not een dded up correctly, return the entire pper to

More information

Minimal DFA. minimal DFA for L starting from any other

Minimal DFA. minimal DFA for L starting from any other Miniml DFA Among the mny DFAs ccepting the sme regulr lnguge L, there is exctly one (up to renming of sttes) which hs the smllest possile numer of sttes. Moreover, it is possile to otin tht miniml DFA

More information

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true. York University CSE 2 Unit 3. DFA Clsses Converting etween DFA, NFA, Regulr Expressions, nd Extended Regulr Expressions Instructor: Jeff Edmonds Don t chet y looking t these nswers premturely.. For ech

More information

Continuous Random Variable X:

Continuous Random Variable X: Continuous Rndom Vrile : The continuous rndom vrile hs its vlues in n intervl, nd it hs proility distriution unction or proility density unction p.d. stisies:, 0 & d Which does men tht the totl re under

More information

Monte Carlo method in solving numerical integration and differential equation

Monte Carlo method in solving numerical integration and differential equation Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The

More information

QUADRATURE is an old-fashioned word that refers to

QUADRATURE is an old-fashioned word that refers to World Acdemy of Science Engineering nd Technology Interntionl Journl of Mthemticl nd Computtionl Sciences Vol:5 No:7 011 A New Qudrture Rule Derived from Spline Interpoltion with Error Anlysis Hdi Tghvfrd

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016 CS125 Lecture 12 Fll 2016 12.1 Nondeterminism The ide of nondeterministic computtions is to llow our lgorithms to mke guesses, nd only require tht they ccept when the guesses re correct. For exmple, simple

More information

7.2 The Definite Integral

7.2 The Definite Integral 7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS The University of Nottinghm SCHOOL OF COMPUTER SCIENCE LEVEL 2 MODULE, SPRING SEMESTER 2016 2017 LNGUGES ND COMPUTTION NSWERS Time llowed TWO hours Cndidtes my complete the front cover of their nswer ook

More information

1 Nondeterministic Finite Automata

1 Nondeterministic Finite Automata 1 Nondeterministic Finite Automt Suppose in life, whenever you hd choice, you could try oth possiilities nd live your life. At the end, you would go ck nd choose the one tht worked out the est. Then you

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

SUMMER KNOWHOW STUDY AND LEARNING CENTRE SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18

More information

4.4 Areas, Integrals and Antiderivatives

4.4 Areas, Integrals and Antiderivatives . res, integrls nd ntiderivtives 333. Ares, Integrls nd Antiderivtives This section explores properties of functions defined s res nd exmines some connections mong res, integrls nd ntiderivtives. In order

More information

Nondeterminism and Nodeterministic Automata

Nondeterminism and Nodeterministic Automata Nondeterminism nd Nodeterministic Automt 61 Nondeterminism nd Nondeterministic Automt The computtionl mchine models tht we lerned in the clss re deterministic in the sense tht the next move is uniquely

More information

10. AREAS BETWEEN CURVES

10. AREAS BETWEEN CURVES . AREAS BETWEEN CURVES.. Ares etween curves So res ove the x-xis re positive nd res elow re negtive, right? Wrong! We lied! Well, when you first lern out integrtion it s convenient fiction tht s true in

More information

Operations with Polynomials

Operations with Polynomials 38 Chpter P Prerequisites P.4 Opertions with Polynomils Wht you should lern: How to identify the leding coefficients nd degrees of polynomils How to dd nd subtrct polynomils How to multiply polynomils

More information

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams Chpter 4 Contrvrince, Covrince, nd Spcetime Digrms 4. The Components of Vector in Skewed Coordintes We hve seen in Chpter 3; figure 3.9, tht in order to show inertil motion tht is consistent with the Lorentz

More information

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O 1 Section 5. The Definite Integrl Suppose tht function f is continuous nd positive over n intervl [, ]. y = f(x) x The re under the grph of f nd ove the x-xis etween nd is denoted y f(x) dx nd clled the

More information

MA123, Chapter 10: Formulas for integrals: integrals, antiderivatives, and the Fundamental Theorem of Calculus (pp.

MA123, Chapter 10: Formulas for integrals: integrals, antiderivatives, and the Fundamental Theorem of Calculus (pp. MA123, Chpter 1: Formuls for integrls: integrls, ntiderivtives, nd the Fundmentl Theorem of Clculus (pp. 27-233, Gootmn) Chpter Gols: Assignments: Understnd the sttement of the Fundmentl Theorem of Clculus.

More information

Chapters Five Notes SN AA U1C5

Chapters Five Notes SN AA U1C5 Chpters Five Notes SN AA U1C5 Nme Period Section 5-: Fctoring Qudrtic Epressions When you took lger, you lerned tht the first thing involved in fctoring is to mke sure to fctor out ny numers or vriles

More information

Rudimentary Matrix Algebra

Rudimentary Matrix Algebra Rudimentry Mtrix Alger Mrk Sullivn Decemer 4, 217 i Contents 1 Preliminries 1 1.1 Why does this document exist?.................... 1 1.2 Why does nyone cre out mtrices?................ 1 1.3 Wht is mtrix?...........................

More information

Model Reduction of Finite State Machines by Contraction

Model Reduction of Finite State Machines by Contraction Model Reduction of Finite Stte Mchines y Contrction Alessndro Giu Dip. di Ingegneri Elettric ed Elettronic, Università di Cgliri, Pizz d Armi, 09123 Cgliri, Itly Phone: +39-070-675-5892 Fx: +39-070-675-5900

More information

0.1 THE REAL NUMBER LINE AND ORDER

0.1 THE REAL NUMBER LINE AND ORDER 6000_000.qd //0 :6 AM Pge 0-0- CHAPTER 0 A Preclculus Review 0. THE REAL NUMBER LINE AND ORDER Represent, clssify, nd order rel numers. Use inequlities to represent sets of rel numers. Solve inequlities.

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

Lecture 08: Feb. 08, 2019

Lecture 08: Feb. 08, 2019 4CS4-6:Theory of Computtion(Closure on Reg. Lngs., regex to NDFA, DFA to regex) Prof. K.R. Chowdhry Lecture 08: Fe. 08, 2019 : Professor of CS Disclimer: These notes hve not een sujected to the usul scrutiny

More information

13: Diffusion in 2 Energy Groups

13: Diffusion in 2 Energy Groups 3: Diffusion in Energy Groups B. Rouben McMster University Course EP 4D3/6D3 Nucler Rector Anlysis (Rector Physics) 5 Sept.-Dec. 5 September Contents We study the diffusion eqution in two energy groups

More information

1 Probability Density Functions

1 Probability Density Functions Lis Yn CS 9 Continuous Distributions Lecture Notes #9 July 6, 28 Bsed on chpter by Chris Piech So fr, ll rndom vribles we hve seen hve been discrete. In ll the cses we hve seen in CS 9, this ment tht our

More information

Homework Solution - Set 5 Due: Friday 10/03/08

Homework Solution - Set 5 Due: Friday 10/03/08 CE 96 Introduction to the Theory of Computtion ll 2008 Homework olution - et 5 Due: ridy 10/0/08 1. Textook, Pge 86, Exercise 1.21. () 1 2 Add new strt stte nd finl stte. Mke originl finl stte non-finl.

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

DFA minimisation using the Myhill-Nerode theorem

DFA minimisation using the Myhill-Nerode theorem DFA minimistion using the Myhill-Nerode theorem Johnn Högerg Lrs Lrsson Astrct The Myhill-Nerode theorem is n importnt chrcteristion of regulr lnguges, nd it lso hs mny prcticl implictions. In this chpter,

More information

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.) CS 373, Spring 29. Solutions to Mock midterm (sed on first midterm in CS 273, Fll 28.) Prolem : Short nswer (8 points) The nswers to these prolems should e short nd not complicted. () If n NF M ccepts

More information

Review of Probability Distributions. CS1538: Introduction to Simulations

Review of Probability Distributions. CS1538: Introduction to Simulations Review of Proility Distriutions CS1538: Introduction to Simultions Some Well-Known Proility Distriutions Bernoulli Binomil Geometric Negtive Binomil Poisson Uniform Exponentil Gmm Erlng Gussin/Norml Relevnce

More information

How can we approximate the area of a region in the plane? What is an interpretation of the area under the graph of a velocity function?

How can we approximate the area of a region in the plane? What is an interpretation of the area under the graph of a velocity function? Mth 125 Summry Here re some thoughts I ws hving while considering wht to put on the first midterm. The core of your studying should be the ssigned homework problems: mke sure you relly understnd those

More information

Formal languages, automata, and theory of computation

Formal languages, automata, and theory of computation Mälrdlen University TEN1 DVA337 2015 School of Innovtion, Design nd Engineering Forml lnguges, utomt, nd theory of computtion Thursdy, Novemer 5, 14:10-18:30 Techer: Dniel Hedin, phone 021-107052 The exm

More information