Danielle Maddix AA238 Final Project December 9, 2016

Similar documents
Complexity of Regularization RBF Networks

Model-based mixture discriminant analysis an experimental study

Maximum Entropy and Exponential Families

Likelihood-confidence intervals for quantiles in Extreme Value Distributions

Methods of evaluating tests

Sensitivity Analysis in Markov Networks

Probabilistic Graphical Models

Chapter 8 Hypothesis Testing

Hankel Optimal Model Order Reduction 1

LOGISTIC REGRESSION IN DEPRESSION CLASSIFICATION

10.5 Unsupervised Bayesian Learning

Normative and descriptive approaches to multiattribute decision making

Nonreversibility of Multiple Unicast Networks

An I-Vector Backend for Speaker Verification

SOA/CAS MAY 2003 COURSE 1 EXAM SOLUTIONS

Coding for Random Projections and Approximate Near Neighbor Search

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

Supplementary Materials

Product Policy in Markets with Word-of-Mouth Communication. Technical Appendix

Weighted K-Nearest Neighbor Revisited

CS 687 Jana Kosecka. Uncertainty, Bayesian Networks Chapter 13, Russell and Norvig Chapter 14,

Feature Selection by Independent Component Analysis and Mutual Information Maximization in EEG Signal Classification

Common Trends in European School Populations

7 Max-Flow Problems. Business Computing and Operations Research 608

eappendix for: SAS macro for causal mediation analysis with survival data

Advanced Computational Fluid Dynamics AA215A Lecture 4

QCLAS Sensor for Purity Monitoring in Medical Gas Supply Lines

A simple expression for radial distribution functions of pure fluids and mixtures

Probabilistic Graphical Models

The Effectiveness of the Linear Hull Effect

Lightpath routing for maximum reliability in optical mesh networks

Convergence of reinforcement learning with general function approximators

Stochastic Combinatorial Optimization with Risk Evdokia Nikolova

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

A Functional Representation of Fuzzy Preferences

CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1

The Influences of Smooth Approximation Functions for SPTSVM

2 The Bayesian Perspective of Distributions Viewed as Information

A Unified View on Multi-class Support Vector Classification Supplement

Handling Uncertainty

A Characterization of Wavelet Convergence in Sobolev Spaces

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 6 (2/24/04) Energy Transfer Kernel F(E E')

A Queueing Model for Call Blending in Call Centers

Random Effects Logistic Regression Model for Ranking Efficiency in Data Envelopment Analysis

11.1 Polynomial Least-Squares Curve Fit

BAYES CLASSIFIER. Ivan Michael Siregar APLYSIT IT SOLUTION CENTER. Jl. Ir. H. Djuanda 109 Bandung

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

On the Bit Error Probability of Noisy Channel Networks With Intermediate Node Encoding I. INTRODUCTION

UPPER-TRUNCATED POWER LAW DISTRIBUTIONS

CONDITIONAL CONFIDENCE INTERVAL FOR THE SCALE PARAMETER OF A WEIBULL DISTRIBUTION. Smail Mahdi

Error Bounds for Context Reduction and Feature Omission

Assessing the Performance of a BCI: A Task-Oriented Approach

Improved likelihood inference for discrete data

An iterative least-square method suitable for solving large sparse matrices

Development of Fuzzy Extreme Value Theory. Populations

18.05 Problem Set 6, Spring 2014 Solutions

estimated pulse sequene we produe estimates of the delay time struture of ripple-red events. Formulation of the Problem We model the k th seismi trae

An Adaptive Optimization Approach to Active Cancellation of Repeated Transient Vibration Disturbances

Singular Event Detection

The Second Postulate of Euclid and the Hyperbolic Geometry

arxiv:math/ v4 [math.ca] 29 Jul 2006

Transformation to approximate independence for locally stationary Gaussian processes

Control Theory association of mathematics and engineering

CMSC 451: Lecture 9 Greedy Approximation: Set Cover Thursday, Sep 28, 2017

Verka Prolović Chair of Civil Engineering Geotechnics, Faculty of Civil Engineering and Architecture, Niš, R. Serbia

Case I: 2 users In case of 2 users, the probability of error for user 1 was earlier derived to be 2 A1

Developing Excel Macros for Solving Heat Diffusion Problems

After the completion of this section the student should recall

max min z i i=1 x j k s.t. j=1 x j j:i T j

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 16 Aug 2004

Packing Plane Spanning Trees into a Point Set

UNCERTAINTY RELATIONS AS A CONSEQUENCE OF THE LORENTZ TRANSFORMATIONS. V. N. Matveev and O. V. Matvejev

Wavetech, LLC. Ultrafast Pulses and GVD. John O Hara Created: Dec. 6, 2013

Inter-fibre contacts in random fibrous materials: experimental verification of theoretical dependence on porosity and fibre width

V. Interacting Particles

Tight bounds for selfish and greedy load balancing

Lecture 3 - Lorentz Transformations

9 Geophysics and Radio-Astronomy: VLBI VeryLongBaseInterferometry

THEORETICAL ANALYSIS OF EMPIRICAL RELATIONSHIPS FOR PARETO- DISTRIBUTED SCIENTOMETRIC DATA Vladimir Atanassov, Ekaterina Detcheva

Multicomponent analysis on polluted waters by means of an electronic tongue

Factorized Asymptotic Bayesian Inference for Mixture Modeling

Peer Transparency in Teams: Does it Help or Hinder Incentives?

Word of Mass: The Relationship between Mass Media and Word-of-Mouth

Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems

Estimation for Unknown Parameters of the Extended Burr Type-XII Distribution Based on Type-I Hybrid Progressive Censoring Scheme

A Spatiotemporal Approach to Passive Sound Source Localization

THE EQUATION CONSIDERING CONCRETE STRENGTH AND STIRRUPS FOR DIAGONAL COMPRESSIVE CAPACITY OF RC BEAM

ONLINE APPENDICES for Cost-Effective Quality Assurance in Crowd Labeling

Maximum Likelihood Multipath Estimation in Comparison with Conventional Delay Lock Loops

Reliability-Based Approach for the Determination of the Required Compressive Strength of Concrete in Mix Design

Advances in Radio Science

Robust Flight Control Design for a Turn Coordination System with Parameter Uncertainties

Counting Idempotent Relations

the following action R of T on T n+1 : for each θ T, R θ : T n+1 T n+1 is defined by stated, we assume that all the curves in this paper are defined

Robust Recovery of Signals From a Structured Union of Subspaces

Improved likelihood inference for discrete data

A new method of measuring similarity between two neutrosophic soft sets and its application in pattern recognition problems

7.1 Roots of a Polynomial

Transcription:

Struture and Parameter Learning in Bayesian Networks with Appliations to Prediting Breast Caner Tumor Malignany in a Lower Dimension Feature Spae Danielle Maddix AA238 Final Projet Deember 9, 2016 Abstrat Bayesian struture and parameter learning an be used in prediting the probability of whether a breast tumor is malignant or benign. Bayesian struture learning is implemented using a K2 Searh over the spae of diagonally ayli graphs (DAGs) to obtain a Bayesian struture with a higher Bayesian sore than that of the Bayesian network orresponding to the Naive Bayes model. We an model the onditional probability distributions (CPDs) for the ontinuous observed features, as gaussians and the disrete lass label as Bernoulli distributed. Parameter learning using maximum likelihood estimation (MLE) is used to learn the mean and variane for the ontinuous CPDs and the observed ounts for the disrete probabilities. A key advantage of this method is that it an be used for feature seletion to redue the dimensionality of the spae. d-separation and the hain rule are used to eliminate variables onditionally independent from the lass. Various metris are used to ompare the methods, where probabilities are weighted with less negative effet than mislassifiations and a reward funtion is defined to see whih methods have more false negatives, the more dangerous outome in this appliation. Preision-reall and ROC urves are also utilized to ompare the methods and see the relation between rounding these probabilities at various thresholds. A. Problem Desription I. INTRODUCTION Deision making under unertainty has a wide variety of appliations in sientifi omputing. One partiular appliation in the biologial sienes to be investigated is in regards to prediting whether a breast tumor is malignant or benign. Even though there are preditive medial proedures that are available for diagnosis, one test may not be definitive enough and there an be error margins of either false positives or false negatives. An algorithm whih ould take into aount many test diagnostis and make a predition has potential to have a broad impat in the medial field. In fat, the medial literature is already beoming rih in suh methods as seen in [1], [2], [3], [4], with the potential goal of patents having to undergo fewer extensive tests. It is lear that some features are very preditive, suh as the size of the tumor, but only onsidering this feature annot give definitive results. The omprehensive dataset utilized is available from the Breast Caner Wisonsin (Diagnosti) Dataset on the UC Irvine Mahine Learning Repository [5]. The dataset is fairly rih in examples with m = 569 patients. It onsists of a matrix with 32 olumns, where eah row onsists of a patient sample. For a given row, the first suh olumn is the patient ID and so ignored in this study and the seond olumn is the label M for the malignant lass, = 1 and B for benign lass, = 0. The remaining n = 30 olumns onsist of the observation vetor, o R n. There are ten distint ontinuous observations, namely the radius, texture, perimeter, area, smoothness, ompatness, onavity, onave points, symmetry and fratal dimension [5]. For eah of these, the mean, standard error and worst ase measurements are reported, where the first ten olumns orrespond to the mean, olumns 11-20 orrespond to the standard error and olumns 21-30 orrespond to the worst ase measurements. The lass distribution is given by 357 benign samples ( 62.7%) and 212 malignant samples ( 37.3%). The output is the probability that the tumor is malignant given the input data, that is p( 1 i o i) for patient i. By analyzing this dataset, we would like to determine the subset of the observed features that are the most relevant in prediting this probability. B. Prior Work Past work in [1], [2], [3], [4] on this dataset has been in regards to mahine learning lassifiation problems, where instead of outputting a probability displaying the unertainty in the predition, a label is assigned to the tumor being malignant or not. This problem redues to omputing the deision boundary between the malignant and benign sets. It was shown in [5] that the sets are linearly separable using all 30 input observed features. Moreover, the best preditive auray, as measured by k-fold ross validation (CV), for k = 10 was obtained using one separating plane in the 3- dimensional feature spae of worst area, worst smoothness and mean texture. This was obtained using the ommon optimization method, linear programming (LP), as desribed in [6]. In partiular, the separating plane was omputed using the Multisurfae Method-Tree (MSM-T), whih is a lassifiation method that solves a LP to onstrut a deision tree [7]. The relevant features were seleted using an exhaustive searh in the spae of 1-4 features and 1-3 separating planes [5].

In [8], mahine learning lassifiation algorithms were implemented to ompute linear deision boundaries, based on the prior literature s analysis that the data is linearly separable. The supervised learning algorithms tested were logisti regression, linear and quadrati Gaussian Disriminant Analysis (GDA) and Support Vetor Mahines (SVM). Logisti regression is a disriminative learning algorithm diretly modeling the onditional probability p( o) using a logisti funtion. The fitting parameter, θ R n+1, inluding the interept terms are omputed via the MLE and then an optimization algorithm is used to find the optimal θ. In GDA, a model is built for both the malignant tumors, p(o 1 ) and for the benign tumors, p(o 0 ). It then learns p( 1 o) using Bayes Rule, namely p( 1 o) = p(o 1 )p( 1 ) p(o 0 )p( 0 ) + p(o 1 )p( 1 ). (1) In linear GDA, the posterior densities are assumed to be multivariate gaussians with means µ 0 and µ 1 R n and the same ovariane matrix Σ R nxn, as alulated using MLE ounts. The prior, p(), is assumed to be Bernoulli distributed. In the quadrati ase, we have two separate ovariane matries Σ 0 and Σ 1. Sine the denominator is the same for both p( 1 o) and p( 0 o), GDA assigns a positive malignant label of 1 if the numerator of p( 1 o) is larger than the numerator of p( 0 o). Lastly, SVM solves a onvex optimization problem to maximize the distane between the points and the deision boundary. It is key to desire to maximize this distane, sine the points near the deision boundary represent a higher level of unertainty. In this study, the k-fold CV is also omputed, using k = 10 and linear GDA reeives the lowest error of 4.46%, using a large number of 29 features. Furthermore, the error, aording to the various metris, namely hold-out CV and k-fold CV dereases as the number of observed features inreases until some upper bound. The absolute error ounts the number of mislassifiations, namely mtest i=1 i i m test, (2) where i is the exat label, i is the predited label and m test is the test set size. A key limitation of these methods is that they are fored to make a lassifiation, even if there is a high unertainty in the probability. This an lead to a large number of false negatives or false positives. Moreover, it shows that a high dimensional feature spae is required for the best results. II. METHODS The goal of this work is to identify the important features to redue the problem from a high-dimensional feature spae to a smaller one. To do so, struture learning is implemented to learn an optimal Bayesian network and is ompared to the results of Bayesian network from Naive Bayes. MLE is utilized to estimate the parameters from the network, omparing both ontinuous gaussian to disrete CPDs. The output of this method will be the probability of the tumor being malignant, given the observations onneted to the lass,, in the Bayesian network. Let o R n denote the n observations and õ Rñ ontain the omponents of o, whih are not onditionally independent of, as inferred from the Bayesian network. Then, using the definition of onditional probability, the desired probability is given by p( o 1:n ) = p( õ 1:ñ ) = p(, õ 1:ñ) p(õ 1:ñ ). (3) We use our inferene methods to infer the joint distributions using the hain rule for Bayesian networks, namely p(x 1,..., x n ) = n i=1 p(x i pa xi ), where pa xi are the parents of x i in the Bayesian network [9]. By the law of total probability, the denominator is simply 1 =0 p(, õ 1:ñ). So p(, õ 1:ñ ) is the only probability needed to ompute using the hain rule and then we an simply normalize. A. Naive Bayes with gaussian CPDs In this first method, we assume that the Bayesian struture is known and follows the Naive Bayes model. The Naive Bayes assumption is that (o i o j ) i j. Thus, the only edges in this simplified Bayesian network are from to eah of the observed features, sine the observations are onditionally independent of eah other. The Bayesian network for the Naive Bayes assumption for the first 10 mean observed features is shown below. radius texture perimeter area smoothness ompatness onave points onavity symmetry fratal dim Fig. 1: Bayes Net for Naive Bayes for mean features Sine has no parents in this model and eah observation, o i has as its only parent, using the hain rule we obtain ñ p(, õ 1:ñ ) = p() p(õ i ) (4) Thus, our desired probability is as follows: ñ p( 1 i=1 o 1:n ) = p(õ i 1 )p( 1 ) ñ i=1 p(õ i 0 )p( 0 ) + ñ i=1 p(õ i 1 )p( 1 ) (5) Sine we are assuming a speifi Bayesian network for this ase, there is only the parameter learning step to ompute these probabilities. Sine is binary, it an only take on 2 disrete values and so it an be estimated by one independent parameter θ = p( 1 ), where p( 0 ) = 1 p( 1 ). The parameters are omputed using a training set of size m train. From MLE parameter learning, p( 1 ) = i=1 i=1 1(1 i ) m train, (6) where 1 is the indiator funtion and so 1( 1 i ) = 1, if i = 1 and 0 otherwise. Thus, this simply redues to the ounting the number of malignant samples in the training set and dividing by the total number of training samples [9]. The next step is to determine an appropriate model for the p(õ i ), for eah i = 1 : ñ. From the suess of GDA in [8],

it is lear that modeling the CPDs for the ontinuous features as gaussians is a good approximation. Thus, we ompute the MLE parameters for a single variable gaussian, namely ˆµ 1, ˆσ 1 2 and ˆµ 0, ˆσ 0 2 on the subset of the data ontaining the malignant and benign samples, respetively. Let m ben = m train m mal. This results in the standard statistial estimates below: i=1 õi1( 0 i ) i=1 (õ i ˆµ 0 ) 2 1( 0 i ) ˆµ 0 = ˆµ 1 =, ˆσ 0 2 = m ben mtrain i=1 õi1( 1 i ), ˆσ 1 2 = m mal m ben i=1 (õ i ˆµ 1 ) 2 1( 1 i ) m mal This formula is similar to Equation (1), where the major differene is that GDA is using MLE to ompute multivariate gaussian parameters and here we are using MLE for a produt of single variable gaussians. Moreover, sine the output is a probability, the denominator is omputed in this ase. After using the training data to learn the parameters, we loop over the test data to predit the probability of malignany on anew patient s tumor given the observations. Thus, for patient with label and observation õ, p(õ i j {0,1} ) = 1 exp( (õi ˆµij)2 ). Equation (5) is then utilized for the 2πˆσ 2 ij final probability. 2ˆσ 2 ij B. Bayesian Network with gaussian CPDs for mean features We no longer assume that the Bayesian network is known and follows a Naive Bayes model. Instead, we use struture learning to learn a Bayesian network with a higher Bayesian sore than the Bayesian sore orresponding to the Bayesian network for Naive Bayes. We first onsider the ase of finding an optimal Bayesian struture for the first 10 mean features. Sine these observed features are ontinuous, the first step is to disretize them into a speified number of bins, with the bin width inversely proportional to the number of bins, namely max(õ i) min(õ i) n bins. This was done using an in-house MATLAB ode, but ould also be done using the Disretizers.jl library in Julia. For these experiments, the optimal parameter for the number of bins was 20. Using the same Gaussian model for the CPDs and Bernoulli prior for p(), we will now ompute a Bayesian struture with a higher Bayesian sore than that of Bayesian network shown in Figure 1 for Naive Bayes. This an be done using the K2GraphSearh in BayesNets.jl to searh over the spae of DAGs for suh a graph with a higher Bayesian sore than the previous graph with the appropriate onditional probability distributions speified for eah node. Sine the output is dependent on the ordering of the nodes given, we loop over 1000 random orderings of the nodes and store the Bayesian network with the highest Bayesian sore, as shown below: texture area radius perimeter smoothness onave points ompatness onavity symmetry (7) fratal dim Fig. 2: Bayesian Net with gaussian CPDs for mean features This Bayesian network an be used to determine the relevant observed features, by taking the subset of the observations with edges onneted to. Using this struture, we an perform inferene using the onditional independene assumption that it enodes. Due to d-separation, several features an be eliminated from the model, reduing the dimensionality of the feature spae. Thus, it is evident sine the path ontains a hain of nodes that given o 4, area, is onditionally independent of o 1, radius and o 3, perimeter. Similarly, given o 7, onave points, is onditionally independent of o 6, ompatness and o 8, onavity. Furthermore, o 10, fratal dimension is unonneted and so is also onditionally independent of. Thus, our model redues to p( o 1:10 ) = p( õ 1:5 ), where õ = (o 2, o 4, o 5, o 7, o 9 ) T. Sine this subgraph follows the Naive Bayes assumption on õ, Equations (4) and (5) hold and the implementation is the same from the prior subsetion. C. Bayesian Network with disrete CPDs for mean features In this approah, we ompute the Bayesian network assuming that the CPDs are disrete. As in the prior subsetion, the first step is to disretize the ontinuous observations into 20 bins. We assume that eah observation remains disrete and an assume integer values from 1 to 20. For this we use an inhouse MATLAB ode to find the optimal Bayesian network, given disrete distributions. We begin with a graph ontaining nodes (, o 1:10 ) over the first mean features to ompare to the method in the prior subsetion. Given a random ordering of the nodes as input, we loop over the predeessors of eah node and add a parent to the node that maximizes the Bayesian sore [9]. We loop over 1000 random orderings of the nodes and return the DAG with the highest Bayesian sore. The omputed Bayesian network with a Bayesian sore for this disrete model of -12,111.62 is shown below. texture area radius perimeter onave points onavity smoothness ompatness symmetry fratal dim Fig. 3: Bayes Net with disrete CPDs mean features Note that this is larger than the -13,473 Bayesian sore for the disrete CPD model of the Bayes Net from Naive Bayes. Using the d-separation properties, it is evident that is onditionally independent from every node, exept for onavity, texture, area, smoothness and symmetry, reduing the dimensionality from 10 to 5. So, we have p( o 1:10 ) = p( õ), where õ = (o 4, o 9, o 2, o 5, o 8 ) T. Using the hain rule, we get: p(, õ 1:ñ ) = 4 p(õ i )p( õ 5 )p(õ 5 ) (8) i=1

Thus, our desired probability p( 1 o 1:n ) is as follows: 4 i=1 p(õ i 1 )p( 1 õ 5 ) 4 i=1 p(õ i 0 )p( 0 õ 5 ) + 4 i=1 p(õ i 1 )p( 1 õ 5 ) Note that p(õ 5 ) anels from the numerator and denominator, sine it does not depend on. To perform inferene, we must first learn the parameters from the network. For eah i, we must ompute p(õ i ). Sine is binary and õ i an take on 20 values, this an be represented by 38 independent parameters, using the fat that probabilities sum to 1. The MLE parameters ˆθ k = p(õ k i ) are simply the observed ounts in the data, where õ i = k orresponding to the appropriate malignant and benign samples, as given below: p(õ k i 1 ) = p(õ k i 0 ) = i=1 1(õk i, 1 i ) i=1 1(1 i ) = i=1 1(õk i, 0 i ) i=1 1(0 i ) = i=1 1(õk i, 1 i ) m mal i=1 1(õk i, 0 i ) m ben (9) (10) Lastly, to alulate p( õ 5 ), we have 20 independent parameters, sine is binary and õ 5 an take on 20 disrete values. So, p( 1 õ k i=1 5) = 1(õk 5, 1 i ) mtrain i=1 1(õk 5 ) (11) Now that we have the neessary parameters to ompute the final probability for new patients, we loop over the disretized test data and alulate the above probabilities for the disrete values of õ. D. Bayesian Network with gaussian pdfs for all features Sine the gaussian CPDs performed better than the disrete model, as seen in the Results setion, we ompute an optimal Bayesian struture using all 30 features, assuming the gaussian model. The omputed Bayesian network is given below, displaying only the nodes not onditionally independent of. worst area worst smoothness symmetry std perimeter std onave points texture worst symmetry Fig. 4: Bayes Net omputed using all 30 features Worst area, worst smoothness and mean texture were seleted, whih were also the features ited as obtaining the best results in [5]. This resulting Bayesian network also follows the Naive Bayes assumption that the observations are onditionally independent of eah other. Thus, Equations (4) and (5) also hold for õ = (o 24, o 25, o 9, o 13, o 17, o 2, o 29 ) T and the implementation follows from above. This is a new ontribution from the work in [8], sine it only looked at subset of features, as stored as onseutive elements in the feature vetor, whereas here we have the key features seleted from eah ategory. III. RESULTS AND DISCUSSION We define the following metri to ompare the various methods, namely mtest i=1 (p(1 i õ i) i ) 2 m test (12) Due to the square, it penalizes probabilities less than the maximum error for mislassifiations, that is prediting 1 when i = 0 or 0 when i = 1. For the Naive Bayes method, we investigate the optimal number of features to obtain the lowest metri. To do so, the k-fold CV is omputed, whih only holds out m/k of the training data eah time for testing and trains on the remaining portion. The result is the average of these k = 10 error metris. As expeted and similar to the findings in [8], this testing error dereases as the number of features is inreased from 20 to 25 to a minimum of approximately 0.06 and then begins to inrease again. We would like to avoid using suh a large feature dimension spae to obtain the minimum metri. Fig. 5: Naive Bayes k-fold CV for various number of features The results for the various methods are summarized below: TABLE I: Probability Metri Results NB 10 Gaussian 10 Disrete 10 Gaussian 30 Training Error 0.0743 0.0559 0.0404 0.0456 k-fold CV 0.0781 0.0625 0.0803 0.0520 For the first three methods, we only onsider the first 10 mean features and in the final method, we onsider all 30 features. The first row is the training error, whih is simply training on the entire dataset and then using this training set as our test set. The training error should learly be lower than the k-fold CV in every ase. The k-fold CV is more informative on the effetiveness of the method beause it shows the method s potential to generalize, whih is the desirable feature that given a subset of the dataset, it an properly diagnose new patients. The disrete CPD method using the first 10 features to alulate the Bayesian network has the lowest training error,

but the highest k-fold CV test error. This illustrates one of the negative effets of using MLE parameter learning for a limited dataset. If the training set data has no ourrene of õ i = k and the given value, then it falsely assigns a probability of 0. This inreases the number of false negatives, as indiated by the lower reall for this method in Table II. The training set, on the other hand, is representative of eah observation having the disrete values. A Bayesian parameter learning approah with Beta and Dirihlet distributions to represent these ounts from the dataset [9] may lead to improvement. With Naive Bayes we need a large number of features, 29 to be preise, to obtain a k-fold CV error of approximately 0.06, whereas with struture learning to learn the Bayesian network of and the observed features, there are only 7 key features resulting in a k-fold CV error of 0.0520. Another key omponent of this researh is determining how to hoose the appropriate lass, using the probability as well as other fators speifi for this appliation. Thus, we need to round the probabilities aording to some threshold parameter, ɛ. If p( 1 õ) ɛ, we diagnose a malignant tumor with label = 1 and if p( 1 õ) < ɛ, we diagnose a benign tumor with label = 0. The simplest ɛ to hoose as a first attempt is 0.5. Using the k-fold CV and rounding to ompute the labels, we alulate the preision, whih is a measure of the number of false positives, the reall, whih is a measure of the number of false negatives and the speifiity. Preision depends on the T P T P +F P, omputed lass label and is defined as p( 1 1 ) = where T P is the number of true positives and F P is the number of false positives. Reall and speifiity depend on the exat label and are defined as p( 1 1 ) = T P T P +F N and p( 0 0 ) = T N F P +T N, respetively, where F N is the number of false negatives and T N is the number of true negatives. TABLE II: Preision, Reall and Speifiity for ɛ = 0.5 NB 10 Gaussian 10 Disrete 10 Gaussian 30 Preision 87.16% 93.18% 90.60% 91.16% Reall 87.81% 87.35% 83.71% 88.37% Speifiity 94.69% 97.10% 96.04% 96.67% and a small ɛ means that we label more malignant samples and so we do not have as many false negatives. Fig. 6: Preision-Reall Curve for various ɛ From this urve, we see that the Bayesian network omputed using Gaussian CPDs with all 30 features is learly the preferable method, sine for the same reall, it has higher preision. For example, if we require a reall 90% using this method, from the plot we an ompute the desired tolerane with the maximum preision. For example, ɛ = 0.32 orresponds to 90.82% reall and 89.83% preision. If we require a higher reall of at least 95%, then we round up to malignant for every probability greater than ɛ = 0.0650. The orresponding preision is 83.96% for a reall of 95.28%. Sine the preision does not drop drastially with suh a low tolerane, this indiates another key feature of this method that it does not ontain many unertain probabilities around 0.5. Thus, there are several values near the extreme values of 0 and 1, unaffeted by this rounding. Is is evident that the Bayesian network over all 30 observations from the gaussian CPDs has the highest reall. We now investigate different values of ɛ to inrease the reall, sine the deision should take into aount the onsequenes of a mislassifiation, rather than just rounding up or down [9]. In this ase, a false negative has higher onsequenes than a false positive beause diagnosing the tumor as benign when it is really malignant ould be deathly, whereas for a false positive more tests may be required to onfirm. In the below figure, we vary ɛ from 0.001 to 0.999, with equally spaed intervals of 0.001 and plot the preision versus the reall for eah ɛ. The small values of ɛ orrespond to the highest reall and lowest preision. Thus, as ɛ is dereasing from 0.9999 to 0.001, the preision is dereasing with respet to the reall. This makes sense, sine a large ɛ implies that we label more benign samples and so we do not have as many false positives Fig. 7: ROC urve for various ɛ

The ROC urve above is another useful urve in measuring the omparative effetiveness of eah method for lassifiation. It plots the reall, that is, the true positive rate, versus 1 speifiity, that is, the false positive rate. It is evident again that the Bayesian network omputed from the 30 features with gaussian CPDs performs the best. This is indiated, sine it has the steepest urve, representing a high positive rate with a low false positive. This shows that all the methods form good separators, sine a random lassifier would be given by the line onneting (0,0) and (1,1). Lastly, we onsider a reward system, whih assigns a reward of 1 to a false negative and λ for 0 λ 1 to a false positive. We vary λ and ompute the plots for the various methods. We an define the following reward metri, mtest i=1 ( i(1 p( 1 i õ i)) 2 + λ(1 i )p( 1 i õ i) 2 ) m test. (13) It is again lear that the Bayesian network from the gaussian CPDs over all 30 features has the largest reward regardless of the value of λ. This also illustrates that the initial gap between Naive Bayes with 10 features and the Bayesian network from the gaussian CPDs over 10 features is muh smaller than the final, revealing that Naive Bayes has a smaller number of mislassified negatives than mislassified positives. On the ontrary, the gap is large for Disrete Bayes Net and Naive Bayes initially, indiating that Disrete Bayes Net has a larger number of mislassified false negatives than false positives, due to the small gap at λ = 1 This was explained due to the MLE disrete parameter learning and lak of data for every bin. Note that for λ = 1, we get the negative k-fold CV testing error from Table I. Fig. 8: λ reward metri urve IV. CONCLUSION The benefits of returning a probability, rather than a lassifiation are lear beause it identities the preditions with the highest unertainties. Physiians an utilize these probabilities to only diagnose the disease if it is within some onfidene level and for probabilities around 0.5 to reommend further testing. Various rounding and reward tehniques an be investigated to determine a desired threshold to ahieve a speifi reall or preision. It is lear from the results that gaussians are good approximations for the onditional probability distributions of the ontinuous observed features. We see that utilizing struture learning an greatly improve upon the Bayesian struture orresponding to the Naive Bayes assumption, by reduing the dimensionality of the problem and the error metri. This study also shows that using the mean features is not enough, sine the worst ase and standard error measurements are also valuable, by displaying the better performane of the struture and parameter learning using all 30 features, rather than just the first 10. It is also lear that using a disrete model with MLE estimates does not generalize as well to the test data for values of the disretized variables that are present in the test and not in the training data. Future work onsists of experimenting with different onditional probability distributions, suh as in hybrid Bayesian networks, where the variables are a mix of disrete, that is, and ontinuous variables, the observed features. Potential distributions inlude logit and probit models. It is evident that there are promising results in the area of applying deision making under unertainty and mahine learning to the realm of aner diagnosis for potential use in ollaboration with established medial tests and to help avoid evasive diagnosti tests on patients. This is yet another example of a promising onnetion between the sientifi omputing and medial fields. REFERENCES [1] W. W.H., S. W.N., and M. O.L., Mahine learning tehniques to diagnose breast aner from fine-needle aspirates, Caner Letters, vol. 77, pp. 163 171, 1994. [2] M. O.L., W. W.H., and S. W.N., Image analysis and mahine learning applied to breast aner diagnosis and prognosis, Analytial and Quantitative Cytology and Histology, vol. 17, pp. 77 87, 1995. [3] W. W.H., S. W.N., H. D.M., and M. O.L., Computerized breast aner diagnosis and prognosis from fine needle aspirates, Arhives of Surgery, vol. 130, pp. 511 516, 1995. [4] M. O.L., W. W.H., S. W.N., and H. D.M., Computer-derived nulear features distinguish malignant from benign breast ytology, Human Pathology, vol. 26, pp. 792 796, 1995. [5] UCI Mahine Learning Repository Center for Mahine Learning and Intelligent Systems, Breast aner wisonsin (diagnosti) data set, https: //arhive.is.ui.edu/ml/datasets/breast+caner+wisonsin+(diagnosti), 1996. [6] B. K.P. and M. O.L., Robust linear programming disrimination of two linearly inseparable sets, Optimization Methods and Software, vol. 1, pp. 23 34, 1992. [7] B. K.P, Deision tree onstrution via linear programming, Proeedings of the 4th Midwest Artifiial Intelligene Cognitive Siene Soiety, pp. 97 101, 1992. [8] M. D., Diagnosing malignant versus benign breast tumors via mahine learning tehniques in high dimensions, http://s229.stanford.edu/ projets2014.html, 2014. [9] K. M. J., Deision Making Under Unertainty: Theory and Appliation. Cambridge, Massahusetts and London, England: MIT Press, 2015.