Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Similar documents
COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Tree Structured Classifier

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Chapter 3: Cluster Analysis

What is Statistical Learning?

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

The blessing of dimensionality for kernel methods

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Pattern Recognition 2014 Support Vector Machines

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

A Scalable Recurrent Neural Network Framework for Model-free

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

Support-Vector Machines

Resampling Methods. Chapter 5. Chapter 5 1 / 52

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Part 3 Introduction to statistical classification techniques

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Five Whys How To Do It Better

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

IAML: Support Vector Machines

Distributions, spatial statistics and a Bayesian perspective

You need to be able to define the following terms and answer basic questions about them:

Department of Electrical Engineering, University of Waterloo. Introduction

Simple Linear Regression (single variable)

CS 109 Lecture 23 May 18th, 2016

Elements of Machine Intelligence - I

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Eric Klein and Ning Sa

Artificial Neural Networks MLP, Backpropagation

Kinetic Model Completeness

Early detection of mining truck failure by modelling its operation with neural networks classification algorithms

Linear Classification

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Checking the resolved resonance region in EXFOR database

Statistical Learning. 2.1 What Is Statistical Learning?

Chapter 6 Classification and Prediction (2)

Collocation Map for Overcoming Data Sparseness

Analysis on the Stability of Reservoir Soil Slope Based on Fuzzy Artificial Neural Network

Naïve Bayesian. From Han Kamber Pei

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Lab 1 The Scientific Method

A Matrix Representation of Panel Data

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

CS6220: DATA MINING TECHNIQUES

CS:4420 Artificial Intelligence

BLAST / HIDDEN MARKOV MODELS

Churn Prediction using Dynamic RFM-Augmented node2vec

CS6220: DATA MINING TECHNIQUES

15-381/781 Bayesian Nets & Probabilistic Inference

Hypothesis Tests for One Population Mean

Least Squares Optimal Filtering with Multirate Observations

7 TH GRADE MATH STANDARDS

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Course Review. COMP3411 c UNSW, 2014

ECEN 4872/5827 Lecture Notes

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Dataflow Analysis and Abstract Interpretation

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

Performance Bounds for Detect and Avoid Signal Sensing

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Sequential Allocation with Minimal Switching

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

, which yields. where z1. and z2

Technical Bulletin. Generation Interconnection Procedures. Revisions to Cluster 4, Phase 1 Study Methodology

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Computational modeling techniques

I.S. 239 Mark Twain. Grade 7 Mathematics Spring Performance Task: Proportional Relationships

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

ENSC Discrete Time Systems. Project Outline. Semester

Time, Synchronization, and Wireless Sensor Networks

Math Foundations 20 Work Plan

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

On classifier behavior in the presence of mislabeling noise

Pure adaptive search for finite global optimization*

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

Graduate AI Lecture 16: Planning 2. Teachers: Martial Hebert Ariel Procaccia (this time)

11. DUAL NATURE OF RADIATION AND MATTER

NUMBERS, MATHEMATICS AND EQUATIONS

Transcription:

Data Mining: Cncepts and Techniques Classificatin and Predictin Chapter 6.4-6 February 8, 2007 CSE-4412: Data Mining 1 Chapter 6 Classificatin and Predictin 1. What is classificatin? What is predictin? 2. Issues regarding classificatin and predictin 3. Classificatin by decisin tree inductin 4. Bayesian classificatin 5. Rule-based classificatin 6. Classificatin by back prpagatin 7. Supprt Vectr Machines (SVM) 8. Summary February 8, 2007 CSE-4412: Data Mining 2 1

Basic Idea (Again) Use ld tuples with knwn classes t classify new tuples with unknwn classes. E.g., tuples: custmers ld tuples: previus and current custmers new tuples: prspective custmers questin: Is custmer a gd credit risk? classes (answers): gd, fair, pr Why nt ust use the class prir prbabilities ver the ld tuples? Yes, why nt? February 8, 2007 CSE-4412: Data Mining 3 Use the Attributes Okay, the tuples have attributes. Use the attribute values t d better classificatin. Idea: Given a new tuple (e.g., <25, $72k, student>), use ust the ld tuples that match exactly t decide. Wuld this wrk? What are the prblems with this apprach? Still a gd idea. Hw can we fix this apprach? February 8, 2007 CSE-4412: Data Mining 4 2

Bayesian Classificatin A statistical classifier Perfrms prbabilistic predictin; i.e., predicts class membership prbabilities. Based n Bayes Therem. Perfrmance A simple Bayesian classifier, naïve Bayesian classifier, has cmparable perfrmance with decisin tree and selected neural netwrk classifiers. Incremental Each training example can incrementally increase / decrease the prbability that a hypthesis is crrect. Prir knwledge can be cmbined with bserved data. Standard Can be cmputatinally intractable, but Prvide a standard f ptimal decisin making, against which ther methds can be measured. February 8, 2007 CSE-4412: Data Mining 5 Bayes Therem Basics Let X be a data sample ( evidence ): class label is unknwn. Let H be a hypthesis that X belngs t class C. Classificatin is t determine P(H X), the prbability that the hypthesis hlds given the bserved data sample X. P(H) (prir prbability), the initial prbability. E.g., X will buy cmputer, regardless f age, incme, P(X): prbability that sample data is bserved. P(X H) (psteriri prbability), the prbability f bserving the sample X, given that the hypthesis hlds. E.g., Given that X will buy cmputer, the prbability that X is age 31..40, incme = medium,... February 8, 2007 CSE-4412: Data Mining 6 3

Bayes Therem Given training data X, psteriri prbability f a hypthesis H, P(H X), fllws the Bayes therem: P ( H X) = P( X H) P( H) P( X) Infrmally, this can be written as psteriri likelihd prir / evidence Predicts X belngs t C i iff the prbability P(C i X) is the highest amng all the P(C k X) fr all the k classes. Practical difficulty: Require initial knwledge f many prbabilities, significant cmputatinal cst. February 8, 2007 CSE-4412: Data Mining 7 Twards a Naïve Bayesian Classifier Let D be a training set f tuples and their assciated class labels, and each tuple is represented by an n-d attribute vectr X = (x 1, x 2,, x n ). Suppse there are k classes C 1, C 2,, C k. Classificatin is t derive the maximum psteriri; i.e., the maximal P(C i X). This can be derived frm Bayes Therem: P( X C ) P( C ) P ( C X) = i i i P( X) Since P(X) is cnstant fr all classes, nly P( C X ) = P( X C ) P( C ) i i i needs t be maximized. February 8, 2007 CSE-4412: Data Mining 8 4

Derivatin f Naïve Bayes Classifier A simplified assumptin: attributes are cnditinally independent (i.e., n dependence relatin between attributes): n P( X Ci) = " P( x Ci) = P( x ) ( )... ( ) 1 Ci! P x 2 Ci!! P x k n Ci k = 1 This greatly reduces the cmputatinal cst: Only cunts the class distributin. If A k is categrical, P(x k C i ) is the # f tuples in C i having value x k fr A k divided by C i, D (# f tuples f C i in D). If A k is cntinuus-valued, P(x k C i ) is usually cmputed based n Gaussian distributin with a mean μ and standard deviatin σ and P(x C i ) is 2 ( x# µ ) 1 # 2 2! g( x, µ,! ) = e 2"! P(x k C i ) = g(x k, µ ci, σ ci ) February 8, 2007 CSE-4412: Data Mining 9 Naïve Bayesian Classifier: Training Dataset Class: C1:buys_cmputer = yes C2:buys_cmputer = n Data sample X = (age <=30, Incme = medium, Student = yes Credit_rating = Fair) age incme studentcredit_rating_cmp <=30 high n fair n <=30 high n excellent n 31 40 high n fair yes >40 medium n fair yes >40 lw yes fair yes >40 lw yes excellent n 31 40 lw yes excellent yes <=30 medium n fair n <=30 lw yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium n excellent yes 31 40 high yes fair yes >40 medium n excellent n February 8, 2007 CSE-4412: Data Mining 10 5

Naïve Bayesian Classifier An Example P(C i ): P(buys_cmputer = yes ) = 9/14 = 0.643 P(buys_cmputer = n ) = 5/14= 0.357 Cmpute P(X C i ) fr each class P(age = <=30 buys_cmputer = yes ) = 2/9 = 0.222 P(age = <= 30 buys_cmputer = n ) = 3/5 = 0.6 P(incme = medium buys_cmputer = yes ) = 4/9 = 0.444 P(incme = medium buys_cmputer = n ) = 2/5 = 0.4 P(student = yes buys_cmputer = yes) = 6/9 = 0.667 P(student = yes buys_cmputer = n ) = 1/5 = 0.2 P(credit_rating = fair buys_cmputer = yes ) = 6/9 = 0.667 P(credit_rating = fair buys_cmputer = n ) = 2/5 = 0.4 X = (age <= 30, incme = medium, student = yes, credit_rating = fair) P(X C i ) : P(X buys_cmputer = yes ) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X buys_cmputer = n ) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X C i )*P(C i ) : P(X buys_cmputer = yes ) * P(buys_cmputer = yes ) = 0.028 P(X buys_cmputer = n ) * P(buys_cmputer = n ) = 0.007 Therefre, X belngs t class ( buys_cmputer = yes ) February 8, 2007 CSE-4412: Data Mining 11 The Zer-Prbability Prblem Naïve Bayesian predictin requires each cnditinal prbability be nn-zer. Otherwise, the predicted prbability will be zer! P( X Ci) = n! P( xk Ci) k = 1 E.g., Suppse a dataset with 1000 tuples, incme=lw (0), incme= medium (990), and incme = high (10). Use Laplacian crrectin (r Laplacian estimatr): Add 1 t each case. E.g., Prb(incme = lw) = 1 / 1003 Prb(incme = medium) = 991 / 1003 Prb(incme = high) = 11 / 1003 The crrected prbability estimates are clse t their uncrrected cunterparts. February 8, 2007 CSE-4412: Data Mining 12 6

Naïve Bayesian Classifier Evaluatin Advantages Easy t implement and maintain. Easy t update incrementally with new training tuples. Gd results btained in many dmains. N issues with verfitting the mdel. Reasnably immune t nise. Can wrk with missing values in data, bth training and fr classifying. Nise in data (incrrect values) get balanced ut (t sme extent). February 8, 2007 CSE-4412: Data Mining 13 Naïve Bayesian Classifier Evaluatin Disadvantages Assumptin: Class cnditinal independence, therefre lss f accuracy. In practice, hwever, dependencies d exist amng variables. E.g., hspital patients prfiles: age, family histry, etc. Symptms: fever, cugh etc. Disease: lung cancer, diabetes, etc. Dependencies amng these cannt be mdeled by a Naïve Bayesian Classifier. Blackbx: Cannt interpret the mdel. Hw t deal with these dependencies? Bayesian Belief Netwrks February 8, 2007 CSE-4412: Data Mining 14 7

Bayesian Belief Netwrks Bayesian belief netwrk allws a subset f the variables cnditinally independent. A graphical mdel f causal relatinships Represents dependencies amng the variables. Gives a specificatin f int prbability distributin. Ndes: randm variables X Z Y P Links: dependency X and Y are the parents f Z, and Y is the parent f P N dependency between Z and P Has n lps r cycles February 8, 2007 CSE-4412: Data Mining 15 Bayesian Belief Netwrk An Example Family Histry Smker The cnditinal prbability table (CPT) fr variable LungCancer: LC (FH, S) 0.8 (FH, ~S) (~FH, S) (~FH, ~S) 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 CPT shws the cnditinal prbability fr each pssible cmbinatin f its parents PsitiveXRay Dyspnea Bayesian Belief Netwrk Derivatin f the prbability f a particular cmbinatin f values f X, frm CPT: P( x x = n 1,..., n)! P( xi Parents( Y i)) i = 1 February 8, 2007 CSE-4412: Data Mining 16 8

Training Bayesian Netwrks Several scenaris: Given bth the netwrk structure (knwn) and all variables bservable: learn nly the CPTs. Netwrk structure knwn, sme hidden variables: gradient descent (greedy hill-climbing) methd, analgus t neural netwrk learning. Netwrk structure unknwn, all variables bservable: search thrugh the mdel space t recnstruct netwrk tplgy. Unknwn structure, hidden variables: N gd algrithms knwn fr this purpse! Ref.: D. Heckerman: Bayesian netwrks fr data mining February 8, 2007 CSE-4412: Data Mining 17 Chapter 6 Classificatin and Predictin 1. What is classificatin? What is predictin? 2. Issues regarding classificatin and predictin 3. Classificatin by decisin tree inductin 4. Bayesian classificatin 5. Rule-based classificatin 6. Classificatin by back prpagatin 7. Supprt Vectr Machines (SVM) 8. Summary February 8, 2007 CSE-4412: Data Mining 18 9

Using IF-THEN Rules fr Classificatin Represent the knwledge in the frm f IF-THEN rules R: IF age = yuth AND student = yes THEN buys_cmputer = yes Rule antecedent/precnditin vs. rule cnsequent Assessment f a rule: cverage and accuracy n cvers = # f tuples cvered by R n crrect = # f tuples crrectly classified by R cverage(r) = n cvers / D /* D: training data set */ accuracy(r) = n crrect / n cvers If mre than ne rule is triggered, need cnflict reslutin. size rdering: Assign the highest pririty t the triggering rules that has the tughest requirement (i.e., with the mst attribute test). class-based rdering: Decreasing rder f prevalence r misclassificatin cst per class. rule-based rdering (decisin list): Rules are rganized int ne lng pririty list, accrding t sme measure f rule quality r by experts. February 8, 2007 CSE-4412: Data Mining 19 Rule Extractin frm a Decisin Tree Rules are easier t understand than large trees. One rule is created fr each path frm the rt t a leaf. Each attribute-value pair alng a path frms a cnunctin: the leaf hlds the class predictin. Rules are mutually exclusive and exhaustive. n student? Example: Rule extractin frm ur buys_cmputer decisin-tree: IF age = yung AND student = n IF age = yung AND student = yes IF age = mid-age age? <=30 31..40 >40 yes yes n credit rating? excellent fair n yes yes THEN buys_cmputer = n THEN buys_cmputer = yes THEN buys_cmputer = yes IF age = ld AND credit_rating = excellent THEN buys_cmputer = yes IF age = yung AND credit_rating = fair THEN buys_cmputer = n February 8, 2007 CSE-4412: Data Mining 20 10

Rule Extractin frm the Training Data Sequential cvering algrithm: Extracts rules directly frm training data. Typical sequential cvering algrithms: FOIL, AQ, CN2, RIPPER. Rules are learned sequentially, each fr a given class C i will cver many tuples f C i but nne (r few) f the tuples f ther classes. Steps: Rules are learned ne at a time. Each time a rule is learned, the tuples cvered by the rules are remved. The prcess repeats n the remaining tuples unless terminatin cnditin, e.g., when n mre training examples r when the quality f a rule returned is belw a user-specified threshld. Cmparisn w/ decisin-tree inductin: Learning a set f rules simultaneusly. February 8, 2007 CSE-4412: Data Mining 21 Learn-One-Rule Start with the mst general rule pssible: cnditin = empty. Add new attributes by adpting a greedy depth-first strategy. Pick the ne that mst imprves the rule quality. Rule-Quality measures: cnsider bth cverage and accuracy. Fil-gain (in FOIL & RIPPER): assesses inf_gain by extending cnditin. ps' ps FOIL _ Gain = ps' "(lg2! lg2 ) ps' + neg' ps + neg It favrs rules that have high accuracy and cver many psitive tuples Rule pruning based n an independent set f test tuples. ps! neg FOIL _ Prune( R) = ps + neg Ps/neg are # f psitive/negative tuples cvered by R. If FOIL_Prune is higher fr the pruned versin f R, prune R. February 8, 2007 CSE-4412: Data Mining 22 11

Chapter 6 Classificatin and Predictin 1. What is classificatin? What is predictin? 2. Issues regarding classificatin and predictin 3. Classificatin by decisin tree inductin 4. Bayesian classificatin 5. Rule-based classificatin 6. Classificatin by back prpagatin 7. Supprt Vectr Machines (SVM) 8. Summary February 8, 2007 CSE-4412: Data Mining 23 Classificatin as a Mathematical Mapping Classificatin: Predicts categrical class labels. E.g., Persnal hmepage classificatin. = (x 1, x 2, x 3, ), y i = +1 r 1 x 1 : # f the wrd hmepage x 2 : # f the wrd welcme Mathematically X = R n, y Y = {+1, 1} We want a functin f: X Y February 8, 2007 CSE-4412: Data Mining 24 12

Linear Classificatin x x x x x x x x x x Binary Classificatin prblem. The data abve the red line belngs t class x. The data belw red line belngs t class Examples: SVM, Perceptrn, Prbabilistic Classifiers February 8, 2007 CSE-4412: Data Mining 25 Discriminative Classifiers Advantages: predictin accuracy is generally high As cmpared t Bayesian methds in general rbust, wrks when training examples cntain errrs fast evaluatin f the learned target functin Bayesian netwrks are nrmally slw. Disadvantages: lng training time difficult t understand the learned functin (weights) Bayesian netwrks can be used easily fr pattern discvery. nt easy t incrprate dmain knwledge Easy in the frm f prirs n the data r distributins February 8, 2007 CSE-4412: Data Mining 26 13

Perceptrn & Winnw x 2 Vectr: x, w Scalar: x, y, w Input: {( 1, y 1 ), } Output: classificatin functin f() f( i ) > 0 fr y i = +1 f( i ) < 0 fr y i = -1 f() => + b = 0 r w 1 x 1 +w 2 x 2 +b = 0 x 1 Perceptrn: Update W additively. Winnw: Update W multiplicatively. February 8, 2007 CSE-4412: Data Mining 27 Classificatin by Backprpagatin Nnlinear Neural netwrk: A set f cnnected input/utput units where each cnnectin has a weight assciated with it. During the learning phase, the netwrk learns by adusting the weights s as t be able t predict the crrect class label f the input tuples. Als referred t as cnnectinist learning due t the cnnectins between units. Backprpagatin: A neural netwrk learning algrithm. Started by psychlgists and neurbilgists t develp and test cmputatinal analgues f neurns. February 8, 2007 CSE-4412: Data Mining 28 14

Neural Netwrk as a Classifier Advantages: High tlerance t nisy data. Ability t classify untrained patterns. Well-suited fr cntinuus-valued inputs and utputs. Successful n a wide array f real-wrld data. Algrithms are inherently parallel. Disadvantages: Lng training time. Require a number f parameters typically best determined empirically; e.g., the netwrk tplgy r structure. Pr interpretability: Difficult t interpret the symblic meaning. behind the learned weights and f hidden units in the netwrk. February 8, 2007 CSE-4412: Data Mining 29 A Neurn (= a perceptrn) - θ k x 0 w 0 x 1 x n w 1 w n f utput y Input vectr x weight vectr w weighted sum Activatin functin Fr Example n y = sign(! w ix i + µ k ) i= 0 The n-dimensinal input vectr x is mapped int variable y by means f the scalar prduct and a nnlinear functin mapping. February 8, 2007 CSE-4412: Data Mining 30 15

Output vectr A Multi-Layer Feed-Frward Neural Netwrk Output layer Hidden layer Input layer Input vectr: X Err = O (1 " O )! Err w Err = O ( 1! O )( T! O ) w i! i 1 O =! I 1 + e I =! w O + " =! w = w + ( l) Err O i i i k + (l) Err i i k k February 8, 2007 CSE-4412: Data Mining 31 Hw des a Multi-Layer Neural Netwrk Wrk? The inputs t the netwrk crrespnd t the attributes measured fr each training tuple. Inputs are fed simultaneusly int the units making up the input layer. They are then weighted and fed simultaneusly t a hidden layer. The number f hidden layers is arbitrary, althugh usually nly ne. The weighted utputs f the last hidden layer are input t units making up the utput layer, which emits the netwrk's predictin. The netwrk is feed-frward in that nne f the weights cycles back t an input unit r t an utput unit f a previus layer. Frm a statistical pint f view, netwrks perfrm nnlinear regressin: Given enugh hidden units and enugh training samples, they can clsely apprximate any functin. February 8, 2007 CSE-4412: Data Mining 32 16

Defining a Netwrk Tplgy First decide the netwrk tplgy: # f units in the input layer, # f hidden layers (if > 1), # f units in each hidden layer, and # f units in the utput layer. Nrmalizing the input values fr each attribute measured in the training tuples t [0.0 1.0]. One input unit per dmain value, each initialized t 0. Output, if fr classificatin and mre than tw classes, ne utput unit per class is used. Once a netwrk has been trained and its accuracy is unacceptable, repeat the training prcess with a different netwrk tplgy r a different set f initial weights. February 8, 2007 CSE-4412: Data Mining 33 Backprpagatin Iteratively prcess a set f training tuples & cmpare the netwrk's predictin with the actual knwn target value. Fr each training tuple, the weights are mdified t minimize the mean squared errr between the netwrk's predictin and the actual target value. Mdificatins are made in the backwards directin: frm the utput layer, thrugh each hidden layer dwn t the first hidden layer, hence backprpagatin. Steps 1. Initialize weights (t small randm #s) and biases in the netwrk. 2. Prpagate the inputs frward (by applying activatin functin). 3. Backprpagate the errr (by updating weights and biases). 4. Terminating cnditin (when errr is very small, etc.). February 8, 2007 CSE-4412: Data Mining 34 17

Backprpagatin and Interpretability Efficiency f backprpagatin: Each epch (ne interatin thrugh the training set) takes O( D * w), with D tuples and w weights. Hwever, # epchs can be expnential in n, the number f inputs, in the wrst case. Rule extractin frm netwrks (netwrk pruning): Simplify the netwrk structure by remving weighted links that have the least effect n the trained netwrk. Then perfrm link, unit, r activatin value clustering. The set f input and activatin values are studied t derive rules describing the relatinship between the input and hidden unit layers. Sensitivity analysis: Assess the impact that a given input variable has n a netwrk utput. The knwledge gained frm this analysis can be represented as rules. February 8, 2007 CSE-4412: Data Mining 35 18