Gene Selection for Colon Cancer Classification using Bayesian Model Averaging of Linear and Quadratic Discriminants

Size: px

Start display at page:

Download "Gene Selection for Colon Cancer Classification using Bayesian Model Averaging of Linear and Quadratic Discriminants"

Reginald Mathews
5 years ago
Views:

1 Gene Selection for Colon Cancer Classification using Bayesian Model Averaging of Linear and Quadratic Discriinants Oyebayo Ridwan Olaniran*, Mohd Asrul Affendi Abdullah Departent of Matheatics and Statistics, Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Pagoh Educational Hub, Pagoh, Johor, Malaysia. Received 30 Septeber 2017; accepted 30 Noveber 2017; available online 28 Deceber 2017 Abstract: Recent findings reveal that various cancer types can be diagnosed using non-clinical approach which involves onitoring of the biological saples using their genes expression profiles. Two of the widely used ethods are Linear and Quadratic discriinant analyses. In this paper, the behaviours of Linear and Quadratic Discriinants Analyses (LDA and QDA) were observed within the fraework of Bayesian odel averaging. We applied Bayesian odel averaging to tackle odel uncertain proble inherent in discriinant analysis. We calibrated the developed classifier on published real life icroarray colon cancer data that contained 2000 gene expression profiles easured on 62 biological saples that coprised 40 tuorous tissue saples and 22 noral tissue saples. The data were also pre-processed using logarithic transforation to base 10 and zero ean unit variance noralization as often done with the dataset. In addition, coparison with Rando Forest (RF), Gradient Boosting Machine (GBM) and Bayesian Additive Regression Trees (BART) was also achieved. Various perforance results established the supreacy of the proposed ethod. Keyword: Cancer; Linear and Quadratic discriinant; Bayesian; Model Uncertainty; Model averaging. 1. Introduction Recent findings reveal that various cancer types can be diagnosed using non-clinical approach which involves onitoring of the biological saples using their genes expression profiles. However, this advanceent is possible due to the enhanceent of icroarray technology which ade it possible to observe gene expression levels of several gene chips concurrently [1, 2]. Several authors have discussed the health benefits of the non-clinical diagnosis breakthrough, but the ajor proble that still exists is how to adequately identify the few subsets of thousands genes whose inforation can be used to reliably classify the RNA saples into their respective biological groups. In addition, it has been observed that adequacy of any ethod strongly depends on the health probles [3]. Several types of achine learning algoriths have been proposed to perfor the task of non-clinical diagnosis RNA (essenger Ribonucleic acid). RNA saples are usually collected one several biological features (genes). The resulting data structure are of the for n p, where the nuber of patients n is far less than the nuber of biological features. This scenario is often tered as High-diensional data [4]. Highdiensionality poses serious proble in statistical analysis and in-fact when building achine learning algoriths. This is because ost statistical ethods require the nuber of patients (equations) to be ore than the nuber of attributes (paraeters) for a unique solution to exist. Bayesian procedures are the eerging solution to ost applications of statistics in the recent tie. In fact, it has the least error rate in theory [5,6]. LDA and QDA are often regarded as the Bayesian classifier [2, 7] because they are otivated by Bayes theore. Two of the iportant assuptions of LDA and QDA is norality assuption of the feature space (x) and also orthogonality of feature space [4]. The efficiency or accuracy of LDA and QDA classifier strongly relies on these two assuptions. The classifiers becoe unstable when the assuptions are not et. High-diensional data usually violates these assuptions as in the case of the data used in this research. The data do not satisfy the norality assuption as well as the diensionality proble often lead to ulticollinearity. In the light of this, robust *Corresponding author: rid4stat@yahoo.co 2017 UTHM Publisher. All right reserved. penerbit.uth.edu.y/ojs/index.php/jst

2 ethods like Rando Forests [8] stochastic gradient boosting [9], Bayesian additive regression trees [10], Bayesian additive regression trees using Bayesian odel averaging [6]. have been developed. The ethods are efficient but are coputationally expensive than the siple LDA and QDA. Therefore, in this paper, we developed enseble of LDA and QDA using Bayesian odel averaging approach in order to increase the efficiency of LDA and QDA when analysing high-diensional data. 2. Bayes Classifier The foreost Bayesian classification ethods are linear and quadratic discriinant analysis [11]. Specifically, Bayes classifier is defined by f C (x) over a rando vector X and rando vector Y where f C (x) Pr(X = x Y = c) (1) denoting the density function of X for an observation that coes fro the Cth class. In other words, f C (x) is relatively large if there is a high probability that an observation in the Cth class has X x, and f C (x) is sall if it is very unlikely that an observation in the Cth class has X x. Then Bayes theore states that; Pr(X = x Y = c) = π Cf C (x) C l=1 π l f l (x) (2) If we assue x is norally distributed and estiate its associated location and scale paraeter by axiizing the likelihood iplies we are constructing a Linear Discriinant Analysis (LDA) or Quadratic Discriinant Analysis (QDA). The classifier constructed is LDA if we assuing equality of class variance if otherwise its QDA. Forally, the discriinant δ c (x) for a class c is define as; δ c (x) = x T Σ 1 µ C 1 2 μ C T Σ 1 µ C + logπ c (3) where µ C is the location ean for class c, and Σ is covariance atrix of p predictors for class c. (2) is often referred to as LDA since we assue class variance are the sae across predictors. If otherwise we can obtain QDA as: The estiates of the unknown paraeters µ 1,..., µ C, π 1,..., π C, and Σ are usually estiated via Maxiu Likelihood (MLE, [11]) as; μ C = 1 n c n c x i i:y i =C C Σ = n c 1 n C Σ c c=1 The estiates obtained are then plugged into (3) and (4) to deterine the LDA and QDA classifier. 3. Bayesian Model Averaging of LDA and QDA Given a classification rule δ(x) as earlier derived, and its associated prior probability P(δ), then we can define a posterior density P(δ k Y) = L(δ k (x))p(δ k ) k=1 L(δ k (x))p(δ k ) (5) as density of all possible classifiers. Usually, what deterines classifier k is the subset of feature x in the atrix. For each k, there exist r subset of x space. Following the earlier work of Clyde on Bayesian odel averaging in [12]. It was derived that under regularized prior P(δ) can be approxiated with the posterior P(δ k Y) as; P(δ k Y) = exp( 0.5BIC(δ k))p(δ k ) P(δ k Y) = where; k=1 exp( 0.5BIC(δ k ))P(δ k ) exp( 0.5BIC(δ k)) k=1 exp( 0.5BIC(δ k )) L(δ k (x)) = exp( 0.5BIC(δ k )) BIC(δ k ) = 2log[L(δ k (x))] + plog(n) (6) (7) After posterior estiation, we can then estiate paraeter of interest by; P[ Y] = P(δ k Y)P( δ k, Y) k=1 δ c (x) = x T Σ 1 µ C 1 2 μ C T Σ c 1 µ C + logπ c (4) 141

3 Specifically, for LDA and QDA, the posterior class probabilities are: For LDA; P[X = x Y = c] = P(LDA Y)P(X = x LDA, Y) k=1 (8) For QDA; P[X = x Y = c] = P(QDA Y)P(X = x QDA, Y) k=1 (9) Equation (8) and (9) correspond to proposed ethods used in this research. 4. Data Calibration The data eployed for this study were obtained fro icroarray Princeton repository ( data/index.htl.) on colon cancer. The data contained 2000 gene expression profiles easured on 62 biological saples that coprised 40 tuorous tissue saples and 22 noral tissue saples. Details of the icroarray experient that produces the data can be found in the works of Alon works in [13]. The data were also pre-processed using logarithic transforation to base 10 and zero ean unit variance noralization as often done with the dataset. We copared the perforance of the ethods BMA-LDA and BMA-QDA with LDA, QDA, RF, BART and GBM using the class specific and overall perforance etrics. The etrics used include overall accuracy, balance accuracy, sensitivity, specificity, negative predictive value, positive predictive value, false positive and false negative. The etrics are coputed based on the confusion atrix between 10 folds cross validated test saples and the actual class. The confusion atrix is presented in Table 1 as observed in [14]; where TN represents True Negative, FP is the False Positive, FN represents False Negative and TP is the True Positive. Also, N* is the total predicted negative and P* represents total predicted positive. Siilarly, N is the total actual negative while P is the total actual positive. T represents the total nuber of observation equivalent to; T = TN + FP + TP + FN (10) Here negative eans noral cells while positive eans tuour cells. Accuracy (ACC): %ACC = 100 ( TN+TP ) T Sensitivity:%Sensitivity = 100 ( TP P ) Specificity:%Specificity = 100 ( TN N ) Balance Accuracy (BACC): %Sensitivity + %Specificity %BACC = 2 Positive Predictive Value (PPV): %PPV = 100 ( TP P ) Positive Predictive Value (PPV): %NPV = 100 ( TN N ) False Positive Rate (FPR): %FPR = 100 %Specificity False Negative Rate (FNR): %FNR = 100 %Sensitivity Misclassification Error Rate (MER): %MER = 100 %ACC Table 1 Confusion atrix True Class Predicted Class Total 0 TN FP N 1 FN TP P Total N* P* T 0: Noral, 1: Tuour 142

4 5. Results and Discussion Table 2 and Table 3 show the results based on v folds cross validation where v = 10. The overall perforance easure using accuracy revealed that the ost accurate classifier for diagnosing colon cancer is either BMA-LDA or BMA-QDA. To control class ibalance, we coputed the balance accuracy which also reveals that BMA-LDA or BMA-QDA are the ost accurate classifiers. Table 2 Perforance easures in (%) for BMA-LDA, BMA-QDA, LDA and QDA based on average of 10 folds cross validation. Methods Metrics BMA-LDA BMA-QDA LDA QDA sens specs PPV NPV FP FN er acc BACC Table 3 Perforance easures in (%) for RF, BART and GBM based on average of 10 folds cross validation. Methods Metrics RF BART GBM sens specs PPV NPV FP FN er acc BACC The class specific etrics is very iportant when diagnosing cancer. The procedure ust be highly sensitive for it to be able to detect its presence as early as possible. Sensitivity result in Table 1 shows that the ost sensitive classifiers are BMA-LDA and BMA-QDA with about 97% sensitivity. 6. Conclusion In this paper, we have presented Bayesian odel averaging approach of LDA and QDA for non-clinical diagnosis of colon cancer. The results using the real life dataset indicated that BMA-LDA and BMA-QDA are the best in ters of overall diagnostic accuracy as well as class specific accuracy. Acknowledgeent This work was supported by Universiti Tun Hussein Onn, Malaysia (grant nuber Vot, U607). References [1] Yahya W. B., Olaniran O. R. and Ige, S. O. (2014). On Bayesian Conjugate Noral Linear Regression and Ordinary Least Square Regression Methods: A Monte Carlo Study in Ilorin Journal of Science, Vol. 1 No. 1 pp [2] Olaniran, O.R., Olaniran, S. F., Yahya, W. B., Banjoko, A. W., Garba, M. K., Ausa, L. B. and Gatta, N. F.(2016). Iproved Bayesian Feature Selection and Classification Methods Using Bootstrap Prior Techniques in Anale. Seria Inforatică. Vol. 14 No. 2 pp [3] Si, Adelene YL;Minary, Peter; Levitt, Michael (2012). "Modeling nucleic acids" in Current Opinion in Structural Biology Nucleic acids/sequences and topology. Vol. 22 No. 3 pp [4] Hastie, T., Jaes, G., Witten, D., & Tibshirani, R. (2013). An Introduction to statistical learning. Springer, New York. [5] Lesaffre E. and Lawson, A. B. (2013). Bayesian Biostatistics. John Wiley & Sons, Ltd, New Jersey. [6] Hernández, B., Raftery, A. E., Pennington, S. R., & Parnell, A. C. (2015). Bayesian Additive Regression Trees using Bayesian Model Averaging. 143

5 [7] Olaniran, O. R., & Yahya, W. B. (2017). Bayesian Hypothesis Testing of Two Noral Saples using Bootstrap Prior Technique in Journal of Modern Applied Statistical Methods, Vol. 16 No. 2 pp [8] Breian, L. (2001). Rando forests. Machine Learning, Vol. 45, pp [9] Friedan, J. H. (2002). Greedy function approxiation: A gradient boosting achine in Annals Statistics., Vol. 29 No. 5 pp [10] Chipan, H.A. George, E. I. and McCulloch R. E. (2010). BART: Bayesian Additive Regression Trees in Annals Applied Statistics, Vol. 4, pp [11] Hastie T, Tibshirani R, Friedan J (2011). The Eleents of Statistical Learning: Prediction, Inference and Data Mining. 2nd edition. Springer- Verlag, New York. [12] Clyde, M. (1999). Bayesian Model Averaging and Model Search Strategies (with discussion). in Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Sith, 157{185. Oxford: Oxford University Press. [13] Alon, U., Barkai, N., Notteran, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.(1999). Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tuor and Noral Colon Tissues Probed by Oligonucleotide Arrays. in Proceeding of National Acadeia of Science, Vol. 96 pp [14] Banjoko, A. W., Yahya, W. B., Garba, M. K., Olaniran, O. R., Dauda, K. A., & Olorede, K. O. (2015). Efficient Support Vector Machine Classification of Diffuse Large B-Cell Lyphoa and Follicular Lyphoa MRNA Tissue Saples. Annals. Coputer Science Series, Vol. 13 No. 2 pp

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition