Supervised Learning in Multilayer Networks

Similar documents
An introduction to Support Vector Machine

Advanced Machine Learning & Perception

Solution in semi infinite diffusion couples (error function analysis)

CHAPTER 10: LINEAR DISCRIMINATION

Machine Learning Linear Regression

Lecture 11 SVM cont

Lecture 6: Learning for Control (Generalised Linear Regression)

Introduction to Boosting

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

Lecture VI Regression

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

( ) () we define the interaction representation by the unitary transformation () = ()

Variants of Pegasos. December 11, 2009

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

CHAPTER 5: MULTIVARIATE METHODS

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

( ) [ ] MAP Decision Rule

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

Machine Learning 2nd Edition

Computing Relevance, Similarity: The Vector Space Model

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

Robust and Accurate Cancer Classification with Gene Expression Profiling

Chapter Lagrangian Interpolation

TSS = SST + SSE An orthogonal partition of the total SS

DEEP UNFOLDING FOR MULTICHANNEL SOURCE SEPARATION SUPPLEMENTARY MATERIAL

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

Chapter 6: AC Circuits

Department of Economics University of Toronto

New M-Estimator Objective Function. in Simultaneous Equations Model. (A Comparative Study)

Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Time-interval analysis of β decay. V. Horvat and J. C. Hardy

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

Notes on the stability of dynamic systems and the use of Eigen Values.

Clustering (Bishop ch 9)

Volatility Interpolation

Linear Response Theory: The connection between QFT and experiments

Fall 2010 Graduate Course on Dynamic Learning

Comb Filters. Comb Filters

GENERATING CERTAIN QUINTIC IRREDUCIBLE POLYNOMIALS OVER FINITE FIELDS. Youngwoo Ahn and Kitae Kim

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

Mechanics Physics 151

Fitting a Conditional Linear Gaussian Distribution

Mechanics Physics 151

THEORETICAL AUTOCORRELATIONS. ) if often denoted by γ. Note that

CHAPTER 2: Supervised Learning

Graduate Macroeconomics 2 Problem set 5. - Solutions

[Link to MIT-Lab 6P.1 goes here.] After completing the lab, fill in the following blanks: Numerical. Simulation s Calculations

Testing a new idea to solve the P = NP problem with mathematical induction

On One Analytic Method of. Constructing Program Controls

Mechanics Physics 151

Lecture 2 L n i e n a e r a M od o e d l e s

Cubic Bezier Homotopy Function for Solving Exponential Equations

FTCS Solution to the Heat Equation

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

How about the more general "linear" scalar functions of scalars (i.e., a 1st degree polynomial of the following form with a constant term )?

Normal Random Variable and its discriminant functions

CH.3. COMPATIBILITY EQUATIONS. Continuum Mechanics Course (MMC) - ETSECCPB - UPC

Pattern Classification (III) & Pattern Verification

Should Exact Index Numbers have Standard Errors? Theory and Application to Asian Growth

12d Model. Civil and Surveying Software. Drainage Analysis Module Detention/Retention Basins. Owen Thornton BE (Mech), 12d Model Programmer

General Weighted Majority, Online Learning as Online Optimization

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

Motion in Two Dimensions

Chapter 4. Neural Networks Based on Competition

Math 128b Project. Jude Yuen

Reconstruction of Missing Data in Social Networks Based on Temporal Patterns of Interactions

The Analysis of the Thickness-predictive Model Based on the SVM Xiu-ming Zhao1,a,Yan Wang2,band Zhimin Bi3,c

Tools for Analysis of Accelerated Life and Degradation Test Data

CSCE 478/878 Lecture 5: Artificial Neural Networks and Support Vector Machines. Stephen Scott. Introduction. Outline. Linear Threshold Units

Robustness Experiments with Two Variance Components

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Fall 2009 Social Sciences 7418 University of Wisconsin-Madison. Problem Set 2 Answers (4) (6) di = D (10)

FI 3103 Quantum Physics

Survival Analysis and Reliability. A Note on the Mean Residual Life Function of a Parallel System

WiH Wei He

January Examinations 2012

P R = P 0. The system is shown on the next figure:

Neural Networks. Understanding the Brain

Comparison of Differences between Power Means 1

Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code

Panel Data Regression Models

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

A Novel Iron Loss Reduction Technique for Distribution Transformers. Based on a Combined Genetic Algorithm - Neural Network Approach

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

EEL 6266 Power System Operation and Control. Chapter 5 Unit Commitment

Online Supplement for Dynamic Multi-Technology. Production-Inventory Problem with Emissions Trading

Foundations of State Estimation Part II

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Density Matrix Description of NMR BCMB/CHEM 8190

Hidden Markov Models

CHAPTER FOUR REPEATED MEASURES IN TOXICITY TESTING

Neural Networks-Based Time Series Prediction Using Long and Short Term Dependence in the Learning Process

SOME NOISELESS CODING THEOREMS OF INACCURACY MEASURE OF ORDER α AND TYPE β

Transcription:

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44 Supervsed Learnng n Mullayer Neworks 44.1 Mullayer perceprons No course on neural neworks could be complee whou a dscusson of supervsed mullayer neworks, also known as backpropagaon neworks. The mullayer percepron s a feedforward nework. I has npu neurons, hdden neurons and oupu neurons. The hdden neurons may be arranged n a sequence of layers. The mos common mullayer perceprons have a sngle hdden layer, and are known as wo-layer neworks, he number wo counng he number of layers of neurons no ncludng he npus. Such a feedforward nework defnes a nonlnear parameerzed mappng from an npu x o an oupu y = y(x; w, A). The oupu s a connuous funcon of he npu and of he parameers w; he archecure of he ne,.e., he funconal form of he mappng, s denoed by A. Feedforward neworks can be raned o perform regresson and classfcaon asks. Regresson neworks In he case of a regresson problem, he mappng for a nework wh one hdden layer may have he form: Oupus Hddens Inpus Fgure 44.1. A ypcal wo-layer nework, wh sx npus, seven hdden uns, and hree oupus. Each lne represens one wegh. Hdden layer: a (1) = l w (1) l x l + θ (1) ; h = f (1) (a (1) ) (44.1) Oupu layer: a (2) = w (2) h + θ (2) ; y = f (2) (a (2) ) (44.2).4.2 where, for example, f (1) (a) = anh(a), and f (2) (a) = a. Here l runs over he npus x 1,..., x L, runs over he hdden uns, and runs over he oupus. The weghs w and bases θ ogeher make up he parameer vecor w. The nonlnear sgmod funcon f (1) a he hdden layer gves he neural nework greaer compuaonal flexbly han a sandard lnear regresson model. Graphcally, we can represen he neural nework as a se of layers of conneced neurons (fgure 44.1). Wha sors of funcons can hese neworks mplemen? Jus as we explored he wegh space of he sngle neuron n Chaper 39, examnng he funcons could produce, le us explore he wegh space of a mullayer nework. In fgures 44.2 and 44.3 I ake a nework wh one npu and one oupu and a large number H of hdden uns, se he bases -.2 -.4 -.6 -.8-1 -1.2-1.4-2 -1 1 2 3 4 5 Fgure 44.2. Samples from he pror over funcons of a one-npu nework. For each of a sequence of values of σ bas = 8, 6, 4, 3, 2, 1.6, 1.2,.8,.4,.3,.2, and σ n = 5σbas w, one random funcon s shown. The oher hyperparameers of he nework were H = 4, σou w =.5. 527

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 528 44 Supervsed Learnng n Mullayer Neworks y Oupu σ ou Hdden layer σ bas σ n Inpu 1 x Oupu 1 5 Hσou 1/σ n -5-1 σ bas/σ n -2-1 1 2 3 4 Inpu Fgure 44.3. Properes of a funcon produced by a random nework. The vercal scale of a ypcal funcon produced by he nework wh random weghs s of order Hσ ou ; he horzonal range n whch he funcon vares sgnfcanly s of order σ bas /σ n ; and he shores horzonal lengh scale s of order 1/σ n. The funcon shown was produced by makng a random nework wh H = 4 hdden uns, and Gaussan weghs wh σ bas = 4, σ n = 8, and σ ou =.5. and weghs θ (1), w (1) l, θ (2) and w (2) o random values, and plo he resulng funcon y(x). I se he hdden uns bases θ (1) o random values from a Gaussan wh zero mean and sandard devaon σ bas ; he npu-o-hdden weghs w (1) l o random values wh sandard devaon σ n ; and he bas and oupu weghs θ (2) and w (2) o random values wh sandard devaon σ ou. The sor of funcons ha we oban depend on he values of σ bas, σ n and σ ou. As he weghs and bases are made bgger we oban more complex funcons wh more feaures and a greaer sensvy o he npu varable. The vercal scale of a ypcal funcon produced by he nework wh random weghs s of order Hσ ou ; he horzonal range n whch he funcon vares sgnfcanly s of order σ bas /σ n ; and he shores horzonal lengh scale s of order 1/σ n. Radford Neal (1996) has also shown ha n he lm as H he sascal properes of he funcons generaed by randomzng he weghs are ndependen of he number of hdden uns; so, neresngly, he complexy of he funcons becomes ndependen of he number of parameers n he model. Wha deermnes he complexy of he ypcal funcons s he characersc magnude of he weghs. Thus we ancpae ha when we f hese models o real daa, an mporan way of conrollng he complexy of he fed funcon wll be o conrol he characersc magnude of he weghs. Fgure 44.4 shows one ypcal funcon produced by a nework wh wo npus and one oupu. Ths should be conrased wh he funcon produced by a radonal lnear regresson model, whch s a fla plane. Neural neworks can creae funcons wh more complexy han a lnear regresson. -1 1-1 -2 -.5.5 1-1 -.5 Fgure 44.4. One sample from he pror of a wo-npu nework wh {H, σ w n, σw bas, σw ou} = {4, 8., 8.,.5}. 1.5 44.2 How a regresson nework s radonally raned Ths nework s raned usng a daa se D = {x (n), (n) } by adusng w so as o mnmze an error funcon, e.g., E D (w) = 1 2 ( (n) 2 y (x (n) ; w)). (44.3) n Ths obecve funcon s a sum of erms, one for each npu/arge par {x, }, measurng how close he oupu y(x; w) s o he arge. Ths mnmzaon s based on repeaed evaluaon of he graden of E D. Ths graden can be effcenly compued usng he backpropagaon algorhm (Rumelhar e al., 1986), whch uses he chan rule o fnd he dervaves.

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44.3: Neural nework learnng as nference 529 Ofen, regularzaon (also known as wegh decay) s ncluded, modfyng he obecve funcon o: M(w) = βe D + αe W (44.4) where, for example, E W = 1 2 w2. Ths addonal erm favours small values of w and decreases he endency of a model o overf nose n he ranng daa. Rumelhar e al. (1986) showed ha mullayer perceprons can be raned, by graden descen on M(w), o dscover soluons o non-rval problems such as decdng wheher an mage s symmerc or no. These neworks have been successfully appled o real-world asks as vared as pronouncng Englsh ex (Senowsk and Rosenberg, 1987) and focussng mulple-mrror elescopes (Angel e al., 199). 44.3 Neural nework learnng as nference The neural nework learnng process above can be gven he followng probablsc nerpreaon. [Here we repea and generalze he dscusson of Chaper 41.] The error funcon s nerpreed as defnng a nose model. βe D s he negave log lkelhood: P (D w, β, H) = 1 Z D (β) exp( βe D). (44.5) Thus, he use of he sum-squared error E D (44.3) corresponds o an assumpon of Gaussan nose on he arge varables, and he parameer β defnes a nose level σ 2 ν = 1/β. Smlarly he regularzer s nerpreed n erms of a log pror probably dsrbuon over he parameers: P (w α, H) = 1 Z W (α) exp( αe W ). (44.6) If E W s quadrac as defned above, hen he correspondng pror dsrbuon s a Gaussan wh varance σ 2 W = 1/α. The probablsc model H specfes he archecure A of he nework, he lkelhood (44.5), and he pror (44.6). The obecve funcon M(w) hen corresponds o he nference of he parameers w, gven he daa: P (w D, α, β, H) = P (D w, β, H)P (w α, H) P (D α, β, H) (44.7) = 1 Z M exp( M(w)). (44.8) The w found by (locally) mnmzng M(w) s hen nerpreed as he (locally) mos probable parameer vecor, w MP. The nerpreaon of M(w) as a log probably adds lle new a hs sage. Bu new ools wll emerge when we proceed o oher nferences. Frs, hough, le us esablsh he probablsc nerpreaon of classfcaon neworks, o whch he same ools apply.

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 53 44 Supervsed Learnng n Mullayer Neworks Bnary classfcaon neworks If he arges n a daa se are bnary classfcaon labels (, 1), s naural o use a neural nework whose oupu y(x; w, A) s bounded beween and 1, and s nerpreed as a probably P (=1 x, w, A). For example, a nework wh one hdden layer could be descrbed by he feedforward equaons (44.1) and (44.2), wh f (2) (a) = 1/(1 + e a ). The error funcon βe D s replaced by he negave log lkelhood: [ ] G(w) = (n) ln y(x (n) ; w) + (1 (n) ) ln(1 y(x (n) ; w)). (44.9) n The oal obecve funcon s hen M = G + αe W. Noe ha hs ncludes no parameer β (because here s no Gaussan nose). Mul-class classfcaon neworks For a mul-class classfcaon problem, we can represen he arges by a vecor,, n whch a sngle elemen s se o 1, ndcang he correc class, and all oher elemens are se o. In hs case s approprae o use a sofmax nework havng coupled oupus whch sum o one and are nerpreed as class probables y = P ( =1 x, w, A). The las par of equaon (44.2) s replaced by: y = ea e a The negave log lkelhood n hs case s G(w) = n. (44.1) (n) ln y (x (n) ; w). (44.11) As n he case of he regresson nework, he mnmzaon of he obecve funcon M(w) = G + αe W corresponds o an nference of he form (44.8). A varey of useful resuls can be bul on hs nerpreaon. 44.4 Benefs of he Bayesan approach o supervsed feedforward neural neworks From he sascal perspecve, supervsed neural neworks are nohng more han nonlnear curve-fng devces. Curve fng s no a rval ask however. The effecve complexy of an nerpolang model s of crucal mporance, as llusraed n fgure 44.5. Consder a conrol parameer ha nfluences he complexy of a model, for example a regularzaon consan α (wegh decay parameer). As he conrol parameer s vared o ncrease he complexy of he model (descendng from fgure 44.5a c and gong from lef o rgh across fgure 44.5d), he bes f o he ranng daa ha he model can acheve becomes ncreasngly good. However, he emprcal performance of he model, he es error, frs decreases hen ncreases agan. An over-complex model overfs he daa and generalzes poorly. Ths problem may also complcae he choce of archecure n a mullayer percepron, he radus of he bass funcons n a radal bass funcon nework, and he choce of he npu varables hemselves n any muldmensonal regresson problem. Fndng values for model conrol parameers ha are approprae for he daa s herefore an mporan and non-rval problem.

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44.4: Benefs of he Bayesan approach o supervsed feedforward neural neworks 531 (a) (d) Tes Error Tranng Error Model Conrol Parameers Log Probably(Tranng Daa Conrol Parameers) Fgure 44.5. Opmzaon of model complexy. Panels (a c) show a radal bass funcon model nerpolang a smple daa se wh one npu varable and one oupu varable. As he regularzaon consan s vared o ncrease he complexy of he model (from (a) o (c)), he nerpolan s able o f he ranng daa ncreasngly well, bu beyond a ceran pon he generalzaon ably (es error) of he model deeroraes. Probably heory allows us o opmze he conrol parameers whou needng a es se. (b) (e) Model Conrol Parameers (c) The overfng problem can be solved by usng a Bayesan approach o conrol model complexy. If we gve a probablsc nerpreaon o he model, hen we can evaluae he evdence for alernave values of he conrol parameers. As was explaned n Chaper 28, over-complex models urn ou o be less probable, and he evdence P (Daa Conrol Parameers) can be used as an obecve funcon for opmzaon of model conrol parameers (fgure 44.5e). The seng of α ha maxmzes he evdence s dsplayed n fgure 44.5b. Bayesan opmzaon of model conrol parameers has four mporan advanages. (1) No es se or valdaon se s nvolved, so all avalable ranng daa can be devoed o boh model fng and model comparson. (2) Regularzaon consans can be opmzed on-lne,.e., smulaneously wh he opmzaon of ordnary model parameers. (3) The Bayesan obecve funcon s no nosy, n conras o a cross-valdaon measure. (4) The graden of he evdence wh respec o he conrol parameers can be evaluaed, makng possble o smulaneously opmze a large number of conrol parameers. Probablsc modellng also handles uncerany n a naural manner. I offers a unque prescrpon, margnalzaon, for ncorporang uncerany abou parameers no predcons; hs procedure yelds beer predcons, as we saw n Chaper 41. Fgure 44.6 shows error bars on he predcons of a raned neural nework. Fgure 44.6. Error bars on he predcons of a raned regresson nework. The sold lne gves he predcons of he bes-f parameers of a mullayer percepron raned on he daa pons. The error bars (doed lnes) are hose produced by he uncerany of he parameers w. Noce ha he error bars become larger where he daa are sparse. Implemenaon of Bayesan nference As was menoned n Chaper 41, Bayesan nference for mullayer neworks may be mplemened by Mone Carlo samplng, or by deermnsc mehods employng Gaussan approxmaons (Neal, 1996; MacKay, 1992c).

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 532 44 Supervsed Learnng n Mullayer Neworks Whn he Bayesan framework for daa modellng, s easy o mprove our probablsc models. For example, f we beleve ha some npu varables n a problem may be rrelevan o he predced quany, bu we don know whch, we can defne a new model wh mulple hyperparameers ha capures he dea of unceran npu varable relevance (MacKay, 1994b; Neal, 1996; MacKay, 1995b); hese models hen nfer auomacally from he daa whch are he relevan npu varables for a problem. 44.5 Exercses Exercse 44.1. [4 ] How o measure a classfer s qualy. You ve us wren a new classfcaon algorhm and wan o measure how well performs on a es se, and compare wh oher classfers. Wha performance measure should you use? There are several sandard answers. Le s assume he classfer gves an oupu y(x), where x s he npu, whch we won dscuss furher, and ha he rue arge value s. In he smples dscussons of classfers, boh y and are bnary varables, bu you mgh care o consder cases where y and are more general obecs also. The mos wdely used measure of performance on a es se s he error rae he fracon of msclassfcaons made by he classfer. Ths measure forces he classfer o gve a /1 oupu and gnores any addonal nformaon ha he classfer mgh be able o offer for example, an ndcaon of he frmness of a predcon. Unforunaely, he error rae does no necessarly measure how nformave a classfer s oupu s. Consder frequency ables showng he on frequency of he /1 oupu of a classfer (horzonal axs), and he rue /1 varable (vercal axs). The numbers ha we ll show are percenages. The error rae e s he sum of he wo off-dagonal numbers, whch we could call he false posve rae e + and he false negave rae e. Of he followng hree classfers, A and B have he same error rae of 1% and C has a greaer error rae of 12%. Classfer A Classfer B Classfer C y 1 y 1 y 1 9 8 1 78 12 1 1 1 1 1 1 Bu clearly classfer A, whch smply guesses ha he oucome s for all cases, s conveyng no nformaon a all abou ; whereas classfer B has an nformave oupu: f y = hen we are sure ha really s zero; and f y = 1 hen here s a 5% chance ha = 1, as compared o he pror probably P ( = 1) =.1. Classfer C s slghly less nformave han B, bu s sll much more useful han he nformaon-free classfer A. One way o mprove on he error rae as a performance measure s o repor he par (e +, e ), he false posve error rae and he false negave error rae, whch are (,.1) and (.1, ) for classfers A and B. I s especally mporan o dsngush beween hese wo error probables n applcaons where he wo sors of error have dfferen assocaed coss. However, here are a couple of problems wh he error rae par : How common sense ranks he classfers: (bes) B > C > A (wors). How error rae ranks he classfers: (bes) A = B > C (wors). Frs, f I smply old you ha classfer A has error raes (,.1) and B has error raes (.1, ), would no be mmedaely evden ha classfer A s acually uerly worhless. Surely we should have a performance measure ha gves he wors possble score o A!

Copyrgh Cambrdge Unversy Press 23. On-screen vewng permed. Prnng no permed. hp://www.cambrdge.org/521642981 You can buy hs book for 3 pounds or $5. See hp://www.nference.phy.cam.ac.uk/mackay/la/ for lnks. 44.5: Exercses 533 Second, f we urn o a mulple-class classfcaon problem such as dg recognon, hen he number of ypes of error ncreases from wo o 1 9 = 9 one for each possble confuson of class wh. I would be nce o have some sensble way of collapsng hese 9 numbers no a sngle rankable number ha makes more sense han he error rae. Anoher reason for no lkng he error rae s ha doesn gve a classfer cred for accuraely specfyng s uncerany. Consder classfers ha have hree oupus avalable,, 1 and a reecon class,?, whch ndcaes ha he classfer s no sure. Consder classfers D and E wh he followng frequency ables, n percenages: Classfer D y? 1 74 1 6 1 1 9 Classfer E y? 1 78 6 6 1 5 5 Boh of hese classfers have (e +, e, r) = (6%, %, 11%). Bu are hey equally good classfers? Compare classfer E wh C. The wo classfers are equvalen. E s us C n dsguse we could make E by akng he oupu of C and ossng a con when C says 1 n order o decde wheher o gve oupu 1 or?. So E s equal o C and hus nferor o B. Now compare D wh B. Can you usfy he suggeson ha D s a more nformave classfer han B, and hus s superor o E? Ye D and E have he same (e +, e, r) scores. People ofen plo error-reec curves (also known as ROC curves; ROC sands for recever operang characersc ) whch show he oal e = (e + + e ) versus r as r s allowed o vary from o 1, and use hese curves o compare classfers (fgure 44.7). [In he specal case of bnary classfcaon problems, e + may be ploed versus e nsead.] Bu as we have seen, error raes can be undscernng performance measures. Does plong one error rae as a funcon of anoher make hs weakness of error raes go away? For hs exercse, eher consruc an explc example demonsrang ha he error-reec curve, and he area under, are no necessarly good ways o compare classfers; or prove ha hey are. As a suggesed alernave mehod for comparng classfers, consder he muual nformaon beween he oupu and he arge, Error rae Reecon rae Fgure 44.7. An error-reec curve. Some people use he area under hs curve as a measure of classfer qualy. I(T ; Y ) H(T ) H(T Y ) = y, P (y)p ( y) log P () P ( y), (44.12) whch measures how many bs he classfer s oupu conveys abou he arge. Evaluae he muual nformaon for classfers A E above. Invesgae hs performance measure and dscuss wheher s a useful one. Does have praccal drawbacks?