CLASSIFICATION OF MEDICAL IMAGES INTO ALZHEIMER S DISEASE, MILD COGNITIVE IMPAIRMENT AND HEALTHY BRAIN

Size: px

Start display at page:

Download "CLASSIFICATION OF MEDICAL IMAGES INTO ALZHEIMER S DISEASE, MILD COGNITIVE IMPAIRMENT AND HEALTHY BRAIN"

Muriel Holmes
5 years ago
Views:

1 CLASSIFICATION OF MEDICAL IMAGES INTO ALZHEIMER S DISEASE, MILD COGNITIVE IMPAIRMENT AND HEALTHY BRAIN A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF RESEARCH IN INFORMATICS IN THE FACULTY OF SCIENCE AND ENGINEERING 2018 By Chao Fang School of Computer Science

2 Contents Abstract 8 Declaration 9 Copyright 10 Acknowledgements 11 1 Introduction Problem Statement Aim and Objectives Aim Objectives Organization of the Document Background Alzheimer s Disease Causes Stages and symptoms Brain changes caused by AD Diagnosis and treatment Diagnosis Treatment Machine learning Machine learning pipeline Supervised learning Unsupervised learning Reinforcement learning

3 2.2.2 Loss function Cross-entropy loss L2 loss Neural network Optimization method Gradient descent Adaptive moment estimation Activation function Sigmoid Rectified linear unit Softmax function Regularization L 2 regularization Dropout Convolutional neural network Convolution layer Pooling layer Data augmentation Model selection Cross-validation Receiver operator characteristics Current approaches to detect AD using machine learning methods Experimental method Research methods Overall design of the system Software & Environment Hardware Software Data Dataset Montreal Neurological Institute average brain of 152 scans (MNI152) Harvard-Oxford cortical and subcortical structural atlases Method Brain extraction

4 3.5.2 Linear registration Segmentation ROI extraction Production of ROI mask ROI extraction Classification Activation function Loss function Pooling Learning algorithm Data augmentation Model selection Results and discussion The use of GPU Comparison of different models on accuracy Effect of data augmentation Comparison of results between ours and others Conclusions and future work Conclusions Future work Bibliography 89 A Full experiment results 95 A.1 Full results B Source code 98 B.1 Pre-processing B.2 Classification B.2.1 Label generation B.2.2 Classification

5 List of Tables 2.1 Main stages of Alzheimer s Disease and related symptoms Metrics derived from the confusion matrix Primary hardware configurations List of third-party packages of Python in this project Demographic data for the subjects from ADNI database Distribution of the instances Hyper-parameters and values for grid search Architecture and hyper-parameters in the experiment for GPU and CPU comparison The best 5 and worst 5 results for three-way classification The best 3 and worst 3 results for two-way classification Effect of data augmentation technique The results of our method and others A.1 Full results for three-way classification A.2 Full results for two-way classification

6 List of Figures 2.1 Comparison of cognitive decline between patients and the normal Comparison between normal brain structure and the structure affected by the AD Machine learning pipeline for supervised learning Decision boundaries Difference between classification and regression Graph of log loss function Demonstration of the perceptron Feedforward neural network A visual example of gradient descent Plot of Sigmoid function Graph of ReLU function Demonstration of Dropout method Difference between standard NN and CNN Computation processes of the convolutional neural network Demonstrations of max pooling and average pooling Examples of data augmentation techniques Examples of data augmentation methods Random splitting method Data splitting for cross-validation Procedure of full cross-validation Confusion matrix for the classification problem with two classes A basic ROC graph Research methods in CS Architecture of the HOKUSAI system ICBM 152 template without skull Harvard-Oxford subcortical and cortical structural atlases

7 3.5 MRI scan before and after brain extraction Demonstration of affine registration Examples of the segmentation process Creation of an image for single substructure Creation of ROI mask Demonstration of ROI extraction Overall structure of the network Time used when using GPU(s) or CPU Error landscapes of training and testing for three-way classification Error landscapes of training and testing for two-way classification ROC graph for two-way classifiers

8 Abstract Alzheimer s disease is the most common form of dementia. This disease not only results in the reduction of the patients quality of life, but also puts a financial and emotional burden on their families and carers. Early diagnosis is the key to this disease since treatments have been found to slow down its progression. In this research, we design and implement a novel automatic computer-aided diagnostic system for the early diagnosis of this disease based on the Magnetic resonance imaging scans. In our work, several substructures of brain related to Alzheimer s disease are extracted from the scans firstly, then a convolutional network is trained to extract the high-level features and classify the data into different groups. By using a proper network architecture, the accuracies of the system are 91.23% and 72.60% for two-way (normal control and Alzheimer s disease) and three-way (normal control, mild cognitive impairment and Alzheimer s disease) classifications respectively. Our system outperforms most of the similar systems using traditional machine learning methods. Besides, in our research, experiments show that the use of graphics cards can reduce the training time. However, the elapsed time is not inversely proportional to the number of the graphics cards used in the training. Moreover, data augmentation technique, mirror flip, is evaluated, and no improvement in the accuracy of our model can be observed by using this technique. This research can be also applied to a wide range of applications which use magnetic resonance imaging scans as input like diagnostic system for tumor in brain or pancreas cancer. We hope that this research can provide researchers in this field with a starting point and useful experience. Keywords: Alzheimer s disease, machine learning, convolutional neural network, deep learning, computer vision. 8

9 Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. 9

10 Copyright i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the Copyright ) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the Intellectual Property ) and any reproductions of copyright works in the thesis, for example graphs and tables ( Reproductions ), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see in any relevant Thesis restriction declarations deposited in the University Library, The University Library s regulations (see library/aboutus/regulations) and in The University s policy on presentation of Theses 10

11 Acknowledgements I would like to thank my family. It is their support that helps me complete this exciting and challenging adventure. I would like to thank Dr Fumie Costen for her patience and support. She advices me on the direction of this project and provides me with the best resource. 11

12 Chapter 1 Introduction 1.1 Problem Statement Alzheimer s Disease (AD) is a progressive and irreversible neurodegenerative disease which is the most common type of dementia [1]. This disease usually happens to the elderly and results in memory loss, cognitive impairment and difficulties with language and self-care [2]. Moreover, according to [3], currently no treatment is able to cure AD or stop its progression. That means once people are diagnosed with Alzheimer s, their brains have been damaged, and the symptoms will occur over time. Hence, this disease not only reduces patients quality of life but also place a heavy burden on their caregivers and families in both financial and emotional aspects. There are two main stages, called Mild Cognitive Impairment (MCI) and AD respectively, during the course of this disease. MCI is considered as the early stage of Alzheimer s and there are mild symptoms or no symptom at all in this stage. However, with the development of this disease, patients in the early stage will develop to Alzheimer s over time. The cause of AD is not fully understood, but one possible cause is the effect of two proteins called amyloid beta and tau in our body, which result in the loss of connections between neurons and eventually lead to the death of brain tissues [3]. According to the researchers from the University of Zurich [4], there is a way to limit the production of amyloid beta to slow down the development of AD. Therefore, it is critical to diagnose AD in the early stage and apply the treatment to slow down the progression before it causes serious damage to the brain and various symptoms. This treatment could improve the patients quality of life and alleviate their symptoms as well as ease the burden on the families. Current diagnosis for AD involves several tests like medical history analysis, Mini-Mental Score Examination (MMSE) and medical tests. Usually, this process is 12

13 1.2. AIM AND OBJECTIVES 13 mainly based on the experience of specialists [1]. According to the statement by [5], it is difficult for a doctor to diagnose this disease in the early stage since there are many other diseases whose symptoms are similar to that of AD. In this case, patients with AD may miss the best time to take treatment. Moreover, the accuracy for the diagnosis of this disease made by using a machine learning method is better than that made by a doctor [6]. So, it is rational and crucial to build a computer-aided diagnostic system for AD based on the machine learning techniques, and this kind of system can help doctors make an accurate diagnosis of this disease in the early stage. 1.2 Aim and Objectives Aim The aim of this project is to develop an automatic computer-aided diagnostic (CAD) system for an accurate diagnosis of Alzheimer s Disease by using the machine learning methods to classify Magnetic Resonance Imaging (MRI) scans Objectives Apply data pre-processing tools to complete skull stripping, registration and segmentation task to get grey matter which is affected by the AD. Use morphometric techniques to process the scans and obtain substructures/ regions of interest (ROIs) of the brain from the grey matter. Based on the ROIs, build and evaluate the models (convolutional network) with various architectures to obtain an accurate classification result. Test and evaluate data augmentation technique which can be used to improve the performance of the model. 1.3 Organization of the Document This document consists of 5 main components and the organization of this document is listed as follows:

14 1.3. ORGANIZATION OF THE DOCUMENT 14 Chapter 2 gives the backgrounds and theories for this project. Section 2.1 explains the details about Alzheimer s Disease including the causes, stages, symptoms and diagnosis and so on. After that, the background, techniques and theories related to machine learning, especially to the neural network and convolutional neural network, are presented in Section 2.2. At last, existing similar computer-aided diagnosis systems for Alzheimer s are described in detail and the limitations in these systems are identified as well. Chapter 3 represents the research method used in this project firstly. According to the limitations of the existing similar systems, idea and overall design of this system are explained in Section 3.2. Moreover, Section 3.3 describes the development and test environments, including developing language, packages and hardware and so on. Section 3.4 explains the data used in this project. The last section of this chapter presents the details of the desired system such as the procedure of pre-processing, the architecture of the neural network and the choice of hyper-parameters. In chapter 4, results of the experiments are collected and illustrated. A variety of comparisons are also performed such as the use of Graphics Processing Unit (GPU) and Central Processing Unit (CPU), the effect of data augmentation techniques. In this thesis, the terms GPU and graphics card have the same meaning. It is because usually there is one GPU on a graphics card. We also compare the performance of our work with that of other similar works. Furthermore, according to the results, discussion and analysis are presented in detail. Chapter 5 concludes the project and the experimental findings. Then, based on the conclusion, possible future investigations are discussed too.

15 Chapter 2 Background 2.1 Alzheimer s Disease Alzheimer s Disease (AD) is a progressive neurodegenerative disease first described and reported by Alois Alzheimer in 1907 [7]. This disease is characterized by the problems with memory and thinking and eventually leads to the death of patients. Alzheimer s disease is the most common cause of dementia compromising 60% to 80% of 46 million people with dementia around the world [8]. Dementia is a term used to define a series of symptoms related to memory loss and a decline in thinking skills. According to the prediction by [8], 1 in 85 people around the world will be affected by dementia and the total number of people with dementia would be million by Except for the affection in the human body, dementia and Alzheimer s also come with a heavy burden on the economy. This report [8] also points out that the cost of dementia in 2016 was around 818 billion U.S. Dollars, and the cost will increase to over 1000 billion this year. By 2030, the cost will double reaching a number of 2000 billion U.S. Dollars. In low-to-medium-income countries, 94% of those with dementia are living and accepting nursing care at home due to the lack of support from the local medical system [8]. Therefore, this disease also puts heavy burden on the patients families Causes Currently, the cause of Alzheimer s Disease is still unknown, and there are several hypotheses. The genetic factor is one of these hypotheses. According to [1], around 1% of AD patients have this disease due to genetic inheritance, and this kind of cases 15

16 2.1. ALZHEIMER S DISEASE 16 is called early-onset familial Alzheimer s disease since these people usually have AD earlier than age 65. Another genetic risk factor is called ɛ4 allele of the apolipoprotein (APOE), which increases the risk of having AD by 3-15 times [7]. In 1991, a research showed that the build-up of amyloid beta in the brain is the root cause of AD [9]. This research holds that the connections between nerve cells in the brain are pruned by amyloid-related protein, and eventually this process results in the death of the neurons. Tau protein is deemed as the root cause of this disease in the tau hypothesis. The research holds that neurofibrillary tangles are formed due to the dysfunction of tau protein inside the nerve cells, and then the microtubules are destroyed and the connections between the nerve cells are disconnected which leads to the death of these cells. Though tangles and plaques exist in the cells of most people, they are found much more in people with Alzheimer s in a specific pattern [8]. Other hypotheses include lifestyle, brain injuries and inflammation and so on. Although none of these hypotheses can explain this disease perfectly, the death of the tremendous nerve cells and the atrophy of brain tissues are confirmed as the consequence of getting AD Stages and symptoms Table 2.1: Main stages of Alzheimer s Disease and related symptoms [1]. Name of stages Pre-dementia Mild Alzheimer s disease Symptoms characterized by forgetting things difficulties in abstract thinking apathy Difficulties in speech and picking up the right words and name Short-term memory loss The problem in the execution of movements Moderate Alzheimer s disease Evident speech difficulties Confusion about the words and the loss of reading and writing skills Long-term memory loss Become moody Personality and behavioural changes Advanced Alzheimer s disease Difficulties in daily activities and personal care Difficulty in communications Movement disorder Vulnerable to infections Lose awareness of recent experiences

17 2.1. ALZHEIMER S DISEASE 17 Figure 2.1: Comparison of cognitive decline between patients and the normal [10]. According to Alzheimer s Society [3], although some common symptoms are identified, the symptoms for different patients with AD vary a lot. Four stages are defined by [1] during the AD course base on the cognitive ability and functional impairment. The symptoms in every stage are summarized in Table 2.1. Pre-dementia, which is usually termed as mild cognitive impairment, is considered as the early stage of Alzheimer s disease. The disease in this stage has no obvious effects on the daily life of people, and the reason for the mild symptoms is usually considered mistakenly as ageing. As for the patients in this stage, the irreversible damages to the brain have already occurred for a decade or more, although the symptoms like cognitive and memory problems are not noticeable. Some of the people with MCI will develop to AD or another type of dementia. The second stage is called the Mild Alzheimer s disease. Those who are classified into this stage and the following two stages are considered as AD patients. Patients in this stage can take daily life without assistance, although they may notice some changes like moderate memory loss. The next is the Moderate stage. The persons with AD in this stage will need care from others and may not be able to express thoughts clearly due to the loss of nerve cells. In the final stage of this disease, the symptoms will be getting worse. Individuals will lose the ability of self-care and they are hardly able to communicate with others Brain changes caused by AD As for human, there are over 100 billion neurons in the brain connected to many others [1]. Brain changes with age and the cognitive function declines due to the loss of neural cells in the brain. That is why some of the elderly have some kinds of

18 2.1. ALZHEIMER S DISEASE 18 thinking and memory problems. However, the great decline in memory and cognitive function could be a sign of the abnormal loss of neurons affected by dementia including Alzheimer s disease. Figure 2.1 shows the difference in cognitive function decline between the normal ageing people and people with AD. The AD patients show a much more decline in cognitive function than the normal elderly. According to the research [5], normally there is a per cent reduction in the volume of hippocampus, which is responsible for forming memories, for people over the age of 60. However, for people with AD, the shrinkage could be between 2.2 and 5.9 per cent. The loss of neurons in the brain firstly takes place in the hippocampus decade before the appearance of the noticeable symptoms. Then, when more neural cells die, the damage spreads to additional parts and eventually leads to the atrophy of the hippocampus. At the final stage of the AD course, the whole brain is affected, and the widespread damage results in the shrinkage of the cerebral cortex, as well as the enlarged ventricles which are shown in Figure 2.2. As we can see, the hippocampus of an AD patient is much smaller and the thickness of the cerebral cortex is thinner than that of a normal people, whereas the ventricles of an AD patient are significantly larger than the normal Diagnosis and treatment Diagnosis According to the description [1], currently, the diagnosis for AD is based on a series of tests since no single test is able to show enough evidences for a diagnosis. Although the tests can help doctors to make a decision, a definite diagnosis can only be made with the postmortem examination of brain tissue. Usually, at first, the doctor will ask the suspect about his/her medical history, symptoms and family factors related to AD in order to rule out other diseases with the same symptoms as the AD. Then, some basic physical exams and diagnostic tests will be taken to identify the health issues resulting in the symptom of dementia. In order to make a final decision, the suspect usually takes a brain scan. Two most widely used imaging techniques are Magnetic Resonance Imaging (MRI) and Computerized Tomography (CT). These kinds of scans can show the changes in the brain, and they are able to help a doctor rule out other conditions like a tumor or stroke and directly observe the changes in the brain affected by AD, such as the atrophied hippocampus and surrounding brain tissue. That means the spacial structure and volume of the substructure of the brain are able to be used as a

19 2.1. ALZHEIMER S DISEASE 19 Figure 2.2: Comparison between normal brain structure (upper) and structure affected by the AD (lower). 27s disease brain comparison.jpg

20 2.2. MACHINE LEARNING 20 biomarkers to detect AD. This principle is also the basic theory of most CAD systems for the diagnosis of AD. These existing works will be discussed in Section 2.3. MMSE is another assessment that can help the doctor to evaluate the person [3]. During the MMSE test, a series of specially designed questions are asked by a professional to test a person s everyday mental skills. The score of MMSE ranges from 0 to 30, and a score lower than 24 suggests that this person has dementia. For the patients with AD, a decrease of two to four points, on average, on the MMSE score can be observed each year [1] Treatment On average, a person with Alzheimer s can live 8 years after the occurrence of noticeable symptoms [3]. However, due to different health and caring states, the lifespan varies from 4 to 20 years. Currently, there is no cure for Alzheimer s disease to eliminate the symptoms or stop its progression, and the damages caused by AD are irreversible. However, it is able to relieve the symptoms and temporarily slow down the development of AD by taking drugs and non-drug treatments. Several drugs, such as Aricept, Reminyl and Ebixa, are often used to help with memory problems and daily living and so on [1]. Products or methods, like electronic reminders and a weekly pill box, are also effective ways to help a patient live independently without a caregiver and deal with memory loss [1]. 2.2 Machine learning Machine learning is a multidisciplinary subject related to a wide range of fields like probability, statistics, optimization, algorithm and artificial intelligence [11]. The application of machine learning in the database domain for the analysis of data is called data mining. It is used to process massive data to build a simple but powerful model, also known as classifier in a classification task, for a diversity of tasks like credit analysis and fault detection and control. Machine learning also plays an important role in artificial intelligence. In order to improve the adaptivity of a robot or machine [12], the system needs an ability to learn from the changing environment instead of following the decisions listed by the designers using the method of exhaustion. Moreover, in the computer vision field, machine learning method like Fast Region-based Convolutional Network (Fast R-CNN) has outperformed the traditional methods for object detection and facial recognition tasks [13].

2.2. MACHINE LEARNING 21 Figure 2.3: Machine learning pipeline for supervised learning [14]. 2.2.1 Machine learning pipeline Usually, a machine learning pipeline, shown in Figure 2.

21 2.2. MACHINE LEARNING 21 Figure 2.3: Machine learning pipeline for supervised learning [14] Machine learning pipeline Usually, a machine learning pipeline, shown in Figure 2.3, includes three main components, namely data, learning algorithm and model [14]. The data is collected before in which there could exist an interesting pattern. According to the type of training data, the machine learning algorithms are divided into two main groups named supervised learning and unsupervised learning, which will be discussed in Section Learning algorithm is the major research area of machine learning [11], and it is used to build a model based on the data. The algorithms used in machine learning are different from the algorithms used to directly solve some target problems like sorting or shortest path problems. The machine learning algorithms are designed to find out the patterns or regularities and build the models from the data [12]. That means the machine learning algorithms are expected to automatically extract and generate a solution from the training data for a specific task such as spam detection and decision making and so on. There is no such kind of existing algorithms that can directly solve these problems. The result of a machine learning algorithm is usually a mathematical model with parameters. This model can be expressed in many forms like linear equations, trees and

2.2. MACHINE LEARNING 22 Figure 2.4: Decision boundaries drawn by the linear model equation (2.1). Left: In 2-dimensional (2D) space, data points are 2D, and the boundary is a line (1D).

22 2.2. MACHINE LEARNING 22 Figure 2.4: Decision boundaries drawn by the linear model equation (2.1). Left: In 2-dimensional (2D) space, data points are 2D, and the boundary is a line (1D). Right: in 3-dimensional (3D) space, data points are 3D, and this boundary is a 2D plane [14]. graphs and so on. A simple example of the models is a linear model [14] defined as w T x b = d w j x j b (2.1) j=1 where w and x are d-dimensional vectors representing weight vector and input vector respectively, w j and x j are the jth elements of weight vector and input vector, and b is a bias. In this case, the purpose of the learning algorithms, in the classification scenario, is to find an appropriate parameter vector w to yield a plane, called decision boundary, that can separate data points based on their classes. Figure 2.4 shows the decision boundaries in 2-dimensional and 3-dimensional space Supervised learning Supervised learning is a kind of machine learning method where each training example consists of an input data and a desired label, also called category [14]. The task of supervised learning is to infer a function that maps the input data to the desired label. As the label for the training instances is available, it is possible to adjust the parameters of the inferred function to better fit the input data according to the difference between the predicted and desired label. That is the reason why this kind of methods is called supervised learning. The most common applications of the supervised learning are regression and classification shown in Figure 2.5. In machine learning, the classification task is related to learn a mapping function to predict the category for a given

23 2.2. MACHINE LEARNING 23 Figure 2.5: Difference between classification and regression [14]. instance, and usually, the categories are discrete. Support Vector Machine (SVM) [15] and decision tree [16] are some of the widely used classification algorithms. The output variables of the regression tasks, on the contrary, are usually continuous and the aim of a regression task is to construct a function to predict a quantity according to the input. The learning algorithms for regression include linear regression and regression tree [17] and so forth Unsupervised learning When the label for each example is not available, the method used in machine learning field is called unsupervised learning. The target is to find the pattern or regularities in the input data without the guidance provided by the label. According to [11], the most common application of unsupervised learning is clustering, whose aim is to separate the samples in a dataset into several disjoint subsets, called clusters. The instances in the same cluster are expected to be similar to each other, and different from those in other clusters. This kind of algorithms includes K-mean and Mixture of Gaussian Model (GMM) [18] and so on Reinforcement learning In some kind of applications, the output of a machine learning method is a series of actions regardless of the availability of the label. In this case, we only pay attention to the result or policy including a series of actions that can achieve the goal, but not to a single action. There is no optimal single action. The machine learning method should be able to evaluate the existing policies and rewards, and learn from the past

24 2.2. MACHINE LEARNING 24 series of actions in order to generate an optimal policy. Game playing and decision making are examples of the applications of reinforcement learning. The algorithms for reinforcement learning include Q-learning [19], State-action-reward-state-action (SARSA) [20] and deep Q-learning [21] and so on Loss function In order to help the machine learning algorithms to find the correct pattern in the data, a precise objective, called loss function, is set for the algorithms to pursue [14]. A loss function is used to measure the mistakes made by a model, and the ultimate target of a learning algorithm is to minimize the mistakes. The sum of the loss function over all data points is called an error function which represents the overall mistakes the model made on the whole dataset. Two of the most commonly used loss functions are cross-entropy and L2 loss Cross-entropy loss Cross-entropy loss function [22], also known as log loss [14], is defined as M L = y c log (f(x) c ) (2.2) c=1 where M is the number of classes, f(x) is the output of the model representing the probability of an instance belonging to class c, and y c, the real label, equals one if this instance belongs to class c, otherwise it equals zero. The cross-entropy loss measures the similarity of the real distribution and the predicted distribution of the dataset. When the real distribution of dataset equals the predicted distribution, the cross-entropy loss function achieves its minimal value. Therefore, the process to minimize the crossentropy loss is also the process to force the predicted distribution to match the real distribution of the dataset. When there are only two classes, the function, known as binary cross-entropy loss function, is a special case of cross-entropy and can be defined as [14] L = y log f(x) + (1 y) log (1 f(x)) (2.3) where y, the real label, is either 1 or 0 and f(x) is the output of the model. Figure 2.6 gives the graph of the binary cross-entropy loss function. As we can see, when the true label is one, if the output of a model close to one, the log loss will approximate to zero. So, according to the log loss function, the learning algorithm is encouraged

25 2.2. MACHINE LEARNING 25 Figure 2.6: Graph of log loss function. images/cross entropy.png to adjust parameters in the model to produce an output toward the true label. The sum of cross-entropy loss function over all examples is called cross entropy error or the negative log-likelihood error L2 loss Another widely used loss function is called L2 Loss, and the error function is known as Mean Square Error (MSE). The loss function and error function are defined as [23] C = 1 2m L = (f(x) y) 2 (2.4) m (f(x i ) y i ) 2. (2.5) i=1 In equation (2.4), f(x) and y are the predicted label and the true label respectively. Equation (2.5) is the error function where m is the total number of examples in the dataset and f(x i ) and y i are the output and label of ith example respectively. The sum of the square of the differences between the predicted labels and corresponding labels over all data points composes the mean square error of the model. Mean square error has an acceptable geometric meaning that represents the Euclidean distance between

26 2.2. MACHINE LEARNING 26 Figure 2.7: Demonstration of the perceptron. Singh22/publication/ /figure/ Single-Layer-Perceptron-Network-Multilayer-Perceptron-Network-These-type-of-feed-forward. png the real value and the prediction [11]. An algorithm that utilizes mean square error as the optimization target is called the least square method [11]. In linear regression, the use of the least square method is trying to find a hyper-plane that minimizes the sum of Euclidean distances between all data points and the hyper-plane Neural network An Artificial Neural Network (ANN) is a kind of machine learning methods. ANN is inspired by the biological nervous systems, such as a brain consisting of a large number of interconnected neural cells, and the way to process information [23]. The first artificial neural network was introduced decades ago, however, the lack of devices and techniques limits its development. With the progress of computation power and relative techniques, such as the use of GPU, neural network starts showing its powerful ability and has outperformed the traditional machine learning methods in many areas such as decision making [21] and computer vision [24] and so on. A neural network is a complex network with a huge number of interconnected neurons and a series of non-linear activation functions, discussed in Section This structure enables the network to approximate almost any given mapping or function from input to output [23]. Therefore, neural network has a remarkable ability to recognize and detect the patterns from complex data. Like in the biological nervous system, the basic unit in

27 2.2. MACHINE LEARNING 27 the artificial neural network is called a neuron, but in ANN, neurons are grouped into layers. A simple neural network, named perceptron, which contains only one neuron is shown in Figure 2.7 [25]. This network containing n weighted inputs, an activation function and one output is one of the simplest networks. Nevertheless, this structure enables it to learn a linearly separable function or represent all primitive Boolean functions except XOR. In a perceptron, usually, the input and output of a neuron are real-numbers, and the output is the weighted sum of the inputs. The output of the neuron is then passed to an activation function, called sign function, to produce the output of the network [25], which can be expressed as f(x) = { 1, if w T x = n i=0 w ix i > 0 1, otherwise. (2.6) In equation (2.6) and Figure 2.7, w i and x i are the ith elements of input vector x and weight vector w separately. x 0 is a constant value 1 and w 0 is the bias. Although perceptron is one of the simplest feedforward networks, it plays an important role as the basic unit in the neural network shown in Figure 2.8. This is an example of feedforward neural networks in which the output of one layer is fed into the next layer and no cycle exists. This kind of fully connected neural network is also known as a multi-layer perceptron, and it can contain some complex connection patterns or loops like Convolutional Neural Network and Recurrent Neural Network [26]. The input layer is where the input vector is fed to the network, so the number of neurons in the input layer equals the dimension of input data. Hidden layers are the layers in the neural network except for the input and output layers. Output layer is the last layer of the neural network responsible to output the result of the network. The result is used for further calculation of an error function. In an artificial neural network, normally activation functions are only available for the neurons in the hidden layers and output layer Optimization method Gradient descent As discussed in Section 2.2.3, an artificial neural network includes a lot of weights and bias, and the values of these parameters are unknown and unable to decide manually. In this case, a learning algorithm called Gradient Descent (GD) is applied to find the

28 2.2. MACHINE LEARNING 28 Figure 2.8: Feedforward neural network [14]. Figure 2.9: A visual example of gradient descent [14].

29 2.2. MACHINE LEARNING 29 optimal weights and bias to reduce the error so that the network can approximate the mapping from the input to the output [23]. Figure 2.9 gives an example of how the gradient descent works. The basic idea of the GD is to update the weights towards the direction of the negative gradient of the cost function, which can keep the error to decrease [14]. Taking Mean Square Error (MSE), introduced in Section 2.2.2, for example, it is one of the most commonly used cost functions in the neural network. The MSE cost function is defined in equation (2.5). The cost function is always nonnegative since there is a square operation. C is defined as the change to the cost function, and according to [23] it can be defined as C C w 1 w 1 + C w 2 w C w n w n = C w (2.7) In (2.7), C is the cost function, w i is the change made to the ith weight in the C network, is the first partial derivative of the cost C with respect to the nth w n weight, C is the gradient vector of ( C, C,..., C ) T and w is the vector w 1 w 2 w n of ( w 1, w 2,..., w n ) T. In order to minimize the cost function, a method is needed to decrease the cost by keeping C negative. After a number of iteration, it is possible to reach the minimal cost value [23]. The problem becomes to find out a method to select w 1, w 2,..., w n in order to keep C negative. If we set the w to be α C where α is called as the learning rate and is a small positive value between 0 and 1. Equation (2.7) will become [23] C α C C = α C 2. (2.8) In this case, the change of C ( C) is always a negative value since C 2 is positive. Equation (2.8) can make sure the cost function C decrease or keep stable all the time [23]. The update rule for the ith weight can be expressed as [23] w t+1 i = w t i α C w i, (2.9) where w i is the ith weight and t is the number of iterations for weight update. The above induction is the principle of gradient descent. Gradient descent is a robust and efficient algorithm widely used in training the neural network. However, there is one limitation that makes the algorithm computationally expensive. From equation (2.5) we can see, in order to calculate the cost, the outputs resulting from each instance have

30 2.2. MACHINE LEARNING 30 to be summed up to update the weight. If the size of the dataset is large, it will take a long time to make a small step. To address this limitation, some improvements have been applied to gradient descent to create less computationally intensive algorithms than the original one. Two most successful algorithms based on gradient descent are mini-batch gradient descent and stochastic gradient descent [26]. The primary improvement of these two algorithms is to use a small batch of the dataset instead of all to calculate the cost and partial gradient for weight updating in each iteration. Therefore, the computation is much smaller in each iteration, and the update of weight is more frequent than the original algorithm. The batch size defines the number of instances that is used for training in each iteration. When all the training examples have been used to train the neural network once, this is called one epoch. The difference between mini-batch gradient descent and stochastic gradient descent lies in the size of the batch. The batch size for mini-batch gradient descent is a hyper-parameter that needs to be set prior to the training, whereas stochastic gradient descent uses only one sample in each iteration Adaptive moment estimation As discussed in Section , another hyper-parameter, called learning rate, needs to be set before training. This hyper-parameter defines the step size to update the weights, and it remains the same throughout the training process [14]. A big learning rate can result in a big change to the weights, but it may make the change so much that the algorithm overshoots the minimum or even make the training process diverge. Whereas, a small learning rate can make the algorithm take a long time to converge since it makes a tiny change to the weights. So, the choice of learning rate is one of the most important decisions for training a neural network successfully. However, it is difficult and empirical for the selection of this hyper-parameter. In order to solve this issue, Kingma and Ba proposed a new learning algorithm [27], called Adaptive Moment Estimation (Adam), which adopts an adaptive learning rate. The learning rate in Adam decreases during the training procedure. The main update rules [27] are defined as g t = w f t (w t 1 ) (2.10) m t = β 1 m t 1 + (1 β 1 ) g t (2.11)

31 2.2. MACHINE LEARNING 31 v t = β 2 v t 1 + (1 β 2 ) g 2 t (2.12) ˆm t = m t (1 β t 1) (2.13) ˆv t = v t (1 β t 2) (2.14) w t = w t 1 α ˆm t ( ˆv t + ɛ) (2.15) where variable t is the timestep, f t (w t 1 ) is the cost function in t timestep with the parameter w at (t 1) timestep, α is the learning rate, g t is the gradient with respect to the parameter w, m t and v t are exponential moving averages of the gradient and the squared gradient in t timestep, and β 1 and β 2 are hyper-parameters, usually set to 0.9 and 0.999, used to control the effect of the previous gradients and current gradient. As m 0 and v 0 are initialized to 0, according to equations (2.11) and (2.12), m t and v t are biased and always closed to zero due to the initial values, especially in the first several timesteps. To address this issue, the bias-corrected estimates ˆm t and ˆv t are calculated to correct the bias. The basic idea of Adam is to introduce both the previous gradients and current gradient with respect to a parameter for update purpose. In this case, it can be considered that each parameter has a unique learning rate based on its own condition. Equation (2.10) is to get the gradient of the cost function with respect to w. Equations (2.11) and (2.12) show the methods to calculate and update the exponential moving averages of the gradient and the squared gradient. Equations (2.13) and (2.14) give the rule to correct the bias for these exponential moving averages since they are set to zero as default, which causes the bias in average. In equation (2.15), the update rule for weight is defined based on calculated variables in previous steps. Adam is efficient and easy to implement. Moreover, in some researches, Adam shows a better performance than other adaptive algorithms like AdaGrad and SGDNesterov and so on [27] Activation function In Section 2.2.3, we have seen an activation function, called sign function, used in the Perceptron. An activation functions in neural network processes the linear combination of the inputs, and produces the output of the neuron [23]. Some of the activation

32 2.2. MACHINE LEARNING 32 Figure 2.10: Plot of Sigmoid function [14]. functions, like sigmoid and softmax, take inputs ranging between negative infinity and positive infinity and produces value in a small range like [0, 1]. It is reasonable to consider that the activation function produces the probability of being activated for a particular neuron and it works like a decision-making function used to decide whether a particular neural should be present. Moreover, the use of the function introduces nonlinearity into the neural network. A neural network with two or more layers and activation functions is able to approximate any linear or non-linear function [23]. Otherwise, the approximation produced by a neural network without any activation function will be less effective when the desired function is non-linear Sigmoid One of the commonly used activation functions is sigmoid function, known as Logistic Curve [14], which can be expressed as ϕ(x) = 1. (2.16) 1 + e x Figure 2.10 shows the graph of this function. It takes input x with the domain of all real numbers, and produces the output ranging between 0 and 1. Moreover, it is a non-linear function, and the derivative of this function is non-negative at any given point. These features enable this function to be used as activation function in the neural network. The output of a perceptron [14] with the sigmoid function can be expressed

33 2.2. MACHINE LEARNING 33 as f(x) = e (wt x b) (2.17) where x is the input vector, w and variable b are the weight vector and bias respectively. There is a limitation in using the sigmoid function. When the absolute value of input increases gradually, the gradient of the sigmoid function with respect to the input reduces to a small value. This situation is called the problem of vanishing gradient. This problem makes the learning procedure difficult to converge and reach the minimum Rectified linear unit Another commonly used activation function is called Rectified Linear Unit (ReLU). The ReLU function can be expressed as f(x) = max(0, x), (2.18) where x is a real number [28]. The graph of ReLU is shown in Figure This graph shows that if the input of ReLU is less than zero, the output will be zero, otherwise, the output equals the input. The gradient of ReLU with respect to the input of this function is a constant value, which makes it possible to avoid the problem of vanishing gradient [23] Softmax function Another important activation function is softmax function. This function is used to normalizes an arbitrary real value K-dimensional vector into a K-dimensional vector where the value of each element is between 0 and 1, and the sum of all elements equals 1 [11]. The output of the softmax function can be considered as a categorical distribution representing the probability of the occurrence of each element in the vector. The equation of softmax [23] is defined as σ(z) j = e z j K k=1 ez k for j = 1,..., K, (2.19) where z and σ(z) are K dimensional vectors representing the input and output of this function respectively, z j is the jth element of the input vector z, and σ(z) j represents the jth element of the output vector. Usually, softmax is used as the activation function

34 2.2. MACHINE LEARNING 34 Figure 2.11: Graph of ReLU function (red) [28]. of the last layer in the neural network. As for a classification task, K equals the number of the classes in the dataset. Therefore, each element in the output vector represents one class and the corresponding value is the probability that the input data belongs to this class Regularization Regularization is a kind of modification made to learning algorithm or error function to reduce the generalization error of a model [26]. In machine learning, the algorithm is designed to learn general patterns or regularities from the training data [11]. Then, according to the learnt model, the predictions are made over future data. However, sometimes, the algorithm builds a model, and the model fits so closely to the training data that it learns some patterns existing in training data only [14]. This problem is so-called overfitting which usually causes the poor generalization ability of the model. This usually happens when using an overly complicated model to explain a dataset with simple patterns. On the contrary, underfitting happens if the algorithm does not learn the general patterns well enough. Both overfitting and underfitting result in poor accuracy and generalization in prediction on a new dataset. Several techniques are developed to prevent these issues.

35 2.2. MACHINE LEARNING L 2 regularization L 2 regularization, also known as weight decay, is a technique used to improve the generalization of a model, as well as to limit the values of the weights by adding a L 2 regularization term to the cost function [23]. The total cost function with a L 2 regularization term and update rule [23] can be expressed as w t+1 i C total = C + λ 2m w 2 2 (2.20) = wi t α ( C total ) = w t wi t i α ( C wi t + λ 2m w 2 2 ) (2.21) wi t The C in equations (2.20) (2.21) is the cost function, λ is a hyper-parameter used to control the effect of the L 2 term and it is usually a small real number between 0 and 1, and w is the weight vector consisting of all weights in the model. There are total m weights. C total is the total cost function equal to the sum of cost function and L 2 term, and w 2 is the L 2 norm of the weight vector w [26]. Equation (2.21) shows the rule to update the weight w t i which is the ith weight in iteration t. As discussed in Section , the purpose of a learning algorithm is to find a vector w that minimize the cost, which is the total cost C total in this case. The additional cost, the L 2 term, enforces the weight vector close to the origin (zero) in order to minimize the total cost. Researches [24] [23] have shown that the introduction of the L 2 term leads to an improvement of a model s generalization. There is another regularization method closely related to the L 2 regularization used in the research [24]. This method adds the weight decay term directly to the equation for updating the weights instead of adding it to the cost function. This method [24] can be expressed as w t+1 i = w t i α ( C w t i + λ w t i) (2.22) where w i is the ith weight, t is the number of iteration for updating weight, α and λ are the learning rate and weight decay rate respectively, and C w i is the gradient of the cost C with respect to w t i. Actually, both methods have the same effect on the weights except a little difference in the coefficients of these two regularization terms.

36 2.2. MACHINE LEARNING 36 Figure 2.12: Demonstration of Dropout method. 73ygRwte2E.webp Dropout Dropout, another regularization method, is designed to improve the generalization performance as well. This method is implemented by discarding a small portion of neurons in a neural network in a training iteration temporarily [24]. That means the architecture of the network is different in one iteration from that in another iteration. Figure 2.12 shows how dropout method works. In each training iteration, some of the nodes in the network are dropped randomly, and the rest with corresponding weights are used and updated in this iteration. The nodes with cross in Figure 2.12 represents the nodes that are dropped. An explanation for why dropout works is that when the neurons are dropped out randomly during training, it is like training many subnetworks with different structures [26]. Each subnetwork overfits to the data in a different way. At the same time, the model becomes more robust to the loss of some neurons. Therefore, the output of the model is like a mixture of the outputs of different networks which could avoid overfitting [23]. This explanation is similar to the principle of the random forest algorithm which builds a large group of decision trees using randomly selected data and features for each tree [14]. That is probably why the dropout method works well in neural network.

37 2.2. MACHINE LEARNING 37 Figure 2.13: Difference between standard NN (Left) and CNN (Right) Convolutional neural network Convolution layer Convolutional Neural Network (CNN) is a specific case of neural network. CNN was first developed in 80s last century and raise recent year due to the improvement of capability especially with the use of powerful GPU [11]. CNN is very much similar to a standard neural network. In a standard neural network, the nodes in the neighboring layers are fully connected to each other, whereas in a CNN the nodes are partially connected to each other [23]. Figure 2.13 shows the difference in the connection method between a CNN and a standard NN. As we can see, the weights and connections in a CNN only exist between some of the nodes. In a CNN, weights in the same layers are grouped into different sets called kernel or filter. There are normally more than one kernels used in each layer. Figure 2.14 gives a demonstration of CNN in a graphical and intuitive way. The first matrix is the input, usually, it is an image where the values represent the intensities of the pixels. The second is a kernel where the values represent the weights. The third matrix is the result of Hadamard product between the kernel and the corresponding area (blue) of the input [23]. The sum of the third matrix is one element of the fourth matrix, known as feature map. As the kernel slides over the input step by step in an overlapping way, different sub-areas of input are involved in the operation. At last, the feature map will be filled with the results shown in Figure 2.14(b). This procedure is more like a mathematical operation called convolution.

The use of convolutional layer and kernel is inspired by the kernels used in signal processing and computer vision for the signal filter, edge and corner detection and so on [11].

38 2.2. MACHINE LEARNING 38 (a) (b) Figure 2.14: Computation processes of CNN. (a) The first element of the result. (b) The last element of the result. This is the reason why this kind of neural network is called convolutional neural network [23]. The use of convolutional layer and kernel is inspired by the kernels used in signal processing and computer vision for the signal filter, edge and corner detection and so on [11]. Another difference between a CNN and standard a NN lies in the use of the weights. Each weight except the bias in the standard NN corresponds to one node only [23]. In CNN, the weights in the same kernel are shared during the convolution operation [26]. That means the weights used to produce the results in the same feature map remain the same. Therefore, the number of feature map is equal to the number of kernels. In a CNN network, usually, there is an activation function to introduce the non-linear feature into the network after each layer. However, a difference between a CNN and a standard neural network is the use of pooling layer. Usually, there is no pooling layer in a standard neural network.

39 2.2. MACHINE LEARNING Pooling layer Pooling is the layer used to summarize the information from the output of the previous layer usually the activation layer [26]. It combines the values in a small neighborhood of the output into a single neuron as the input of the next layer. In some sense, the way how a pooling layer works can be considered as another kernel. However, the operation is to find the representative value of a small area in a non-overlapping way instead of a convolutional operation. Another function of pooling layer is to reduce the computation complexity by reducing the number of neurons [29]. As for a 2 2 max pooling or average pooling, only one neuron is left after the pooling operation. That means 75% data is eliminated. There are two commonly used pooling methods max pooling and average pooling. Max pooling uses the maximum value in the pooling area as the output, whereas average pooling uses the average value as the output. Figure 2.15 depicts the max pooling and average pooling methods. The activation function in a CNN is usually put before the pooling layer [26]. However, there is no rule to define this issue. When the max pooling and ReLU are used in a CNN, the results remain the same regardless of whether the activation function is before or after the pooling layer Data augmentation The amount of data is a key to train a neural network. More data is available for training, higher accuracy can be obtained in generalization [24]. However, sometimes, the performance of a machine learning model is constrained by a small dataset. In the medical area, due to the concern about privacy, usually, a limited number of instances is available in public. Therefore, there is a need for a method to increase the number of data based on the existing data. This kind of techniques or methods is called data augmentation technique. As for the images classification tasks, several techniques are developed and applied, and these techniques result in a good performance [24]. One kind of data augmentation techniques makes changes to the intensities of the original images such as adding noise and changing lighting condition and so on. Figure 2.16 demonstrates these methods. Another kind of techniques applies deformations to the original images. The commonly used transformations are flip, rotation and scale and so on. Figure 2.17 gives a few examples of these methods.

16: Examples of data augmentation techniques.

40 2.2. MACHINE LEARNING 40 Figure 2.15: Demonstration of max pooling (left) and average pooling (right). Pooling-Types.png?x31195 Figure 2.16: Examples of data augmentation techniques. Original image (left), the image with random noises (middle) and with a different lighting condition (right).

2.2. MACHINE LEARNING 41 (a) Original image (left), flipped horizontally(middle), also known as mirror

jpeg (b) Original image (first), clockwise rotated by 90 degrees (second), 180 degrees (third) and 270

jpeg (c) Original image (left), scaled by 10% (middle) and 20% (right). https: //cdn-images-1.medium.

41 2.2. MACHINE LEARNING 41 (a) Original image (left), flipped horizontally(middle), also known as mirror flip, and vertically (right). Wj-0PcWUKTw.jpeg (b) Original image (first), clockwise rotated by 90 degrees (second), 180 degrees (third) and 270 degrees (fourth). F6aNKj3yggkcNXQxYA4A.jpeg (c) Original image (left), scaled by 10% (middle) and 20% (right). https: //cdn-images-1.medium.com/max/1600/1*inltn7gwm-m69guwfzpoaq.jpeg Figure 2.17: Examples of data augmentation methods. (a) flip. (b) rotation. (c) scaling.

2.2. MACHINE LEARNING 42 Figure 2.18: Random splitting method [14]. 2.2.9

42 2.2. MACHINE LEARNING 42 Figure 2.18: Random splitting method [14] Model selection Cross-validation Usually, a given task can be solved by using different learning algorithms, which can generate several models. In order to choose the best model, model evaluation and selection are needed. Model selection is the process to pick the optimal model from a group of possible models based on their performance, and model evaluation is able to give an estimate of the future generalization error [14]. The best way to evaluate the model is to test the model on an unseen dataset. It is unfair and unreasonable to use training error to evaluate the model, since the model is trained on the training dataset. Three techniques can be used for model selection, namely random split, K-fold crossvalidation and full cross-validation [14]. From Figure 2.18 we can see, as for supervised learning, randomly splitting the dataset into two parts, named training and testing data, is a method for model selection. In this method, the model is built on the training data firstly, and then the unseen testing data is used to evaluate the model. The testing error can be used as criteria for model selection. However, limitation lies in this method. The dataset is split into training and testing datasets randomly. Due to the chance during splitting, even the same model could produce different results in two trials. Therefore, this method is not a good estimate of the future generalization error of the model [14]. Model selection via cross-validation is a better method than that via random dataset splitting. Cross-validation, also known as rotation estimation, is a group of techniques used to assess the generalization performance of the models [14]. These techniques

2.2. MACHINE LEARNING 43 Figure 2.19: Data splitting for cross-validation [14]. include K-fold cross-validation, Leave One Out Cross Validation (LOOCV) and full cross-validation.

43 2.2. MACHINE LEARNING 43 Figure 2.19: Data splitting for cross-validation [14]. include K-fold cross-validation, Leave One Out Cross Validation (LOOCV) and full cross-validation. The basic idea is to split the samples into several complementary subsets shown in Figure Each subset is called a fold, and K-fold means that there are total K disjoint subsets. In K-fold cross-validation, the model is trained on all folds except one and tested on this fold that is not used for training in a rotation way [30]. That means each fold is used for model testing in turn, and the rest is used to train the model. Finally, the average of the testing errors resulting from each fold is the overall cross-validation error. This testing error is a good estimate of the model s generalization performance. LOOCV follows the procedure of basic cross-validation, but there is only one sample in each fold [30]. It is quite compute-intensive and usually suitable for a small dataset only. In full cross-validation, the dataset is partitioned into two parts firstly [14]. Then, K-fold cross-validation is performed on one part, major part of the dataset, for parameter tuning and model selection. At last, the test is performed separately on the other part. This process ensures that the testing dataset remains unseen during the parameter tuning and this dataset is used for evaluating testing error only. Figure 2.20 shows the process of this method Receiver operator characteristics Another technique, called Receiver Operator Characteristics (ROC) graph, and related metrics provide a more sophisticated way to evaluate the performance of a model [14]. As described in Section 2.2.1, the performance is estimated based on the classification error or classification accuracy. However, classification error is a weak metric to measure the performance of a model [31], when the class distribution is skewed or the costs of misclassification are unequal.

44 2.2. MACHINE LEARNING 44 Figure 2.20: Procedure of full cross-validation [14]. Figure 2.21: Confusion matrix for the classification problem with two classes [31].

45 2.2. MACHINE LEARNING 45 Table 2.2: Metrics derived from the confusion matrix [31]. Metrics Equations Descriptions Recall False Positive rate (FP rate) Specificity Precision Accuracy F-measure T P P F P N T N (F P + T N) = T N N T P (T P + F P ) (T P + T N) (P + N) (2 precision recall) (precision + recall) Also known as true positive rate and sensitive, describe the percentage of all positive instances that are correctly identified. Also known as false alarm rate, describes the percentage of negative instances that are incorrectly identified This metric is the percentage of negative instances that are correctly identified, and it is equal to 1 - FP rate. This metric indicates the percentage of instances that are classified as positive is correct. This is the percentage of all instances that are classified correctly. A higher recall could result in a lower precision by predicting all instance into positive, and vice versa. This metric considers both the precision and recall, and it is a tradeoff between recall and precision to avoid extreme cases.

46 2.2. MACHINE LEARNING 46 Figure 2.22: A basic ROC graph [31]. Figure 2.21 gives the confusion matrix for a two-class classification problem. A twoby-two confusion matrix, also known as a contingency table, is built based on the predicted and actual labels. This matrix plays a role as a basis for many other performance metrics, which are shown in Table 2.2. For example, as for a rare disease, if the classifier predicts all instances into negative class, it will generate an acceptable classification accuracy. The reason is that the positive instances only account for a much smaller portion of the whole dataset than the negative. Usually, the cost of failing to identify the disease is much higher than that of raising a false alarm. Therefore, in some cases, classification accuracy is a weak metric to measure the performance of a model. F-measure in Table 2.2, the harmonic average of the precision and recall, can be used to measure the performance of this kind of models [31]. It considers both precision and recall of the model and eliminates the bias caused by imbalanced classes distribution. A basic ROC graph is a 2-dimensional graph where x-axis represents the False Positive (FP) rate and y-axis represents the True Positive (TP) rate [31]. Figure 2.22 shows an example of ROC graph. The model with a pair of metrics (TP rate and FP rate) corresponds a data point in the figure. The data point D in the left-top corner has a higher

47 2.3. CURRENT APPROACHES TO DETECT AD USING MACHINE LEARNING METHODS47 TP rate and a lower FP rate than other points (A, B, C, E). The performance of any point (classifier) on the diagonal equals random guessing and the classifier provides no information at all like point C. In this case, usually, the point (classifier) close to upper left corner is selected. No classifier should appear in the area below the diagonal. The classifier in this area means that it performs worse than random guessing. The reason for a point below the diagonal could be the misuse of TP rate and FP rate. The choice between points A and B depends on the specific task. If the task pays more attention to TP rate than FP rate, point B will be a better choice than point A. With the wide use of machine learning in many areas, especially in medical diagnosis, ROC plays an increasingly important role and usually outperform classification accuracy in cost-sensitive learning and the learning with unbalanced classes [32]. 2.3 Current approaches to detect AD using machine learning methods Several computer-aided diagnostic systems have been built and some of them have achieved the remarkable results. Among these systems, there are mainly three different types of methods, voxel-based method, morphometry-based method and deep learning method. [33] performed a research utilizing the voxel-based approach in The data they used was collected from difference centres. There are total 68 scans including 34 subjects with post-mortem confirmation of AD and 34 normal controls (NC). In the pre-processing stage, they firstly segmented these images into white matter, grey matter and cerebrospinal fluid. Then, population template was created using these scans, and each grey matter image was normalized to this template. Each normalized grey matter segment was treated as a high dimensional data. At last, a linear support vector machine (SVM) was trained by using the data for classification. The accuracy for the classification of AD and NC is between 87.5% and 96.4% when different groups of data were used as the training and testing data respectively. Beside of SVM, other models were also used for the classification of MRI images base on voxel-based approach. [5] developed an AD diagnosis system by using Primary Component Analysis (PCA) and neural network. The data used in this research is from the Open Access Series of Imaging Studies (OASIS) MRI database 1 including 457 MRI images. Firstly, 1

48 2.3. CURRENT APPROACHES TO DETECT AD USING MACHINE LEARNING METHODS48 the images were registered into atlas space of Talairach. And then the normalized center slices of 230 images were selected as training data. Based on the 230 images, Principal Component Analysis (PCA) was performed, and the top 150 primary components were retained. Therefore, a 150 dimensional vector is used as features to represent each image. A six-layer neural network with 150 input nodes was trained for classification. The accuracy for two-way classification (AD vs. normal control) this research achieved is 89.22%. [34] performed a series of experiments by using dimensional reduction techniques and machine learning methods in The dataset used in this research is from Alzheimer s Disease Neuroimaging Initiative (ADNI) 2. The design of this research is divided into two phases. As for the pre-processing stage, several dimensional reduction techniques, PCA, Histogram and Downscaling, were applied to the MRI images separately. At the second stage, two machine learning methods, decision tree and neural network, were trained based on these pre-processed images. That means there are total 6 results yielded by the experiment. As for three-way (normal, MCI and AD) classification, the best result they got was an accuracy of 60.2% by using the downscaling and decision tree. Moreover, they also conducted three two-way classifications for (NC+MCI) vs. AD, (NC+AD) vs. MCI and (AD+MCI) vs. NC respectively. The accuracy ranges from 52% to 85.8% depending on the use of different combinations of the methods. From these previous researches we can see, although some of these researches achieved a high accuracy on their datasets, limitations lie in these researches. The voxel-based approach uses voxel as a basic feature element and the intensity of a voxel is the value of the feature element. The extracted features of MRI images are treated as the data in high dimensional space. The machine learning methods are trained to separate the data into different groups. As discussed in Section 2.1.3, the obvious change in the brains resulting from the AD is the shrinkage of various substructures mainly consisting of the grey matter. And the shape, volumetric and structural changes caused by AD are used as the biomarkers to diagnose AD in clinical diagnosis. However, voxel-based methods neglect these changes in the brain caused by AD, which could result in a low accuracy. Moreover, the intensity of the voxel tends to be affected by other factors like noise and the scan machines. Therefore, the features used in this kind of approach are very noisy [35]. Furthermore, most of these researches introduce some dimensional reduction techniques which usually result in the loss of information. Both noisy data 2

49 2.3. CURRENT APPROACHES TO DETECT AD USING MACHINE LEARNING METHODS49 and dimension reduction techniques could cause the overfitting of the model to irrelevant noise and the reduction of the accuracy as well. Another type of method is called Morphometry-based method, which came over the limitation of the voxel-based method by considering the morphometric changes in the brain. By using the dataset from ADNI, a team proposed a morphometry-based analysis method [36]. They firstly corrected the intensities of the MRI images and segmented grey matter and white matter into different images. Then, these segmented images were non-linearly registered to the MNI space. By using the grey matter images, two z-score [37] maps, grey matter volume z-score map and cortical thickness z-score map, were created to present grey matter volume and cortical thickness decline as classification features. They also calculated and collected the volumes of several brain substructures like hippocampus and entorhinal cortex. Based on these features, a support vector machine model was trained as a classifier. According to their results, the accuracy of classification of AD and NC was 92.7%, and for three-way classification, the accuracy was 74.6%. Another morphometry-based research was designed and implemented by Henriquez in 2017 [38], and the data used in this research is from ADNI. At the pre-processing stage, the skulls of these MRI images were removed firstly. And then, the images were linearly registered into MNI space and segmented into grey matter, white matter and cerebrospinal fluid. At last, through the use of nonlinear registration and ROI masks, ten regions of interest were extracted from the brain structure for each image, and the volumes of these ROIs were calculated and used as features to train an SVM model. Eventually, they reached an accuracy of 88.5% for the classification of AD and NC. As we can see from these previous researches, the morphometry-based method is able to take structural information like volume into account, which conforms to clinical diagnosis and tends to be understood easily. However, the complex preprocessing stages make it difficult to implement and the results highly depend on the quality of pre-processing. Moreover, domain knowledge is needed since the extraction of the substructures from a brain image was conducted based on the domain knowledge. Another problem lies in the calculation of the volume or feature of the extracted ROIs. The calculation highly depends on the parameters selected by the researcher manually. A minor change in parameter selection may cause a huge difference in the result of the calculation. With the raising of deep learning, a new type of method occurred recently, which is able to eliminate the disadvantage of voxel-based and morphometry-based methods. [29]

50 2.3. CURRENT APPROACHES TO DETECT AD USING MACHINE LEARNING METHODS50 conducted a research that used a convolutional neural network to classify MRI scans into different groups. This research used the ADNI dataset in their experiment, including 2265 MRI scans. In the preprocessing stage, images were registered into the International Consortium for Brain Mapping template firstly. Then, the normalization of images was performed by using zero-mean normalization, which involves the subtraction of population mean and the division of standard deviation. In the classification stage, these processed images were firstly used to train a neural network called autoencoder (3D), whose input and output are equal to each other. In this network, D filters/kernels, size of 5 5 5, were trained and obtained. The purpose of this step was to obtain the 150 kernels which would be used as the convolutional kernels in the following convolutional network. After training the autoencoder, a network with one convolutional layer and two plain neural network layers was built as the classifier. The CNN kernels in this network were the 150 3D kernels trained in the previous autoencoder. The accuracies of this system for two-way (AD vs. NC) and three-way (AD vs. MCI vs. NC) predictions were 95.39% and 89.47% respectively. Another computeraided diagnosis system was implemented by [35] using deep learning. The datasets used in this research are from Computer-Aided Diagnosis of Dementia 3 (CADDemetial) and ADNI. There is no preprocessing operation in this research which means the input for this system is the raw MRI image. At first, a three-layer autoencoder network was built like the previous research. However, the purpose of this autoencoder network is not to train convolutional kernels but to extract the features from each MRI scan. Eight 3-D feature maps with the size of were trained and used as features. Then, a four-layer fully connected neural network was trained using the eight feature maps of each MRI scan as input. In this research, the autoencoder network works as a feature extraction stage, and the fully connected network plays the role of the task-specific classifier. Moreover, the training data for autoencoder and the plain neural network was from the CADDemetial and ADNI respectively, which means this feature extraction method has some ability of generalization. This research reflects the state-of-the-art method reaching the accuracies of 97.6% and 89.1% for two-way (AD vs. NC) and three-way (AD vs. MCI vs. NC) predictions. Both of these two researches are the state-of-the-art methods. In their systems, they use the deep learning methods which include an autoencoder network for feature extraction and a fully connected neural network for the classification purposes. Although these two systems achieved the highest accuracy among all types of methods, there are 3

51 2.3. CURRENT APPROACHES TO DETECT AD USING MACHINE LEARNING METHODS51 still some problems in the systems. The memory required for the running of these two methods could be huge, especially when the dataset is large since the raw MRI data is used as input of these systems. Moreover, less understandable is another limitation. Auto-feature extraction is an attractive technique. However, it is difficult to explain whether the extracted features are related to AD, but not related to another brain disease since no AD-related information or label was used in the feature extraction stage.

52 Chapter 3 Experimental method 3.1 Research methods There are mainly three kinds of research methods in Computer Science (CS), theoretical CS, computer engineering and experimental CS [39] and the procedures are shown in Figure 3.1. According to the analysis of the existing researches in Section 2.3, it is clear that this kind of researches belongs to the computer engineering research method shown in Figure 3.1 (b) and cannot be proofed theoretically or by the analysis of observations. In this case, a system needs to be designed and implemented. And then, the evaluation of this system needs to be performed through taking experiments. 3.2 Overall design of the system According to the analysis on the existing Alzheimer s diagnostic systems, discussed in Section 2.3, we can see that some limitations lie in these systems. By identifying and tackling these limitations, it is possible to propose a new method to achieve a better performance in terms of accuracy. In our research, the morphometric method is chosen since this kind of methods considers volumetric and structural changes affected by the AD in brains. We can use these biomarkers, which are also used in clinical practice, as input, and this makes the features used for classification task reliable and understandable. In terms of the classification task, the process of diagnosis for AD using machine learning method can be considered as image classification task since an MRI scan is a 3-D image itself. As for this kind of tasks, according to the studies [23] [24], CNN is able to generate better results than many other machine learning methods. Hence, two phases, pre-processing and classification, are introduced in the proposed 52

53 3.2. OVERALL DESIGN OF THE SYSTEM 53 (a) (b) (c) Figure 3.1: Research methods in CS [39]. (a) Theoretical CS. (b) Computer engineering. (c) Experimental CS.

54 3.3. SOFTWARE & ENVIRONMENT 54 new CAD system. In the first stage, seven substructures/rois of brain affected by AD will be extracted into one image as the low level features by using the morphometric method. They are hippocampus, amygdala, caudate, putamen, pallidum, thalamus and medial temporal lobe. According to the research [38], these ROIs are discriminative features closely related to AD and they are able to yield satisfying result. The size of the input of the neural network can be reduced after the pre-processing stage since we feed the neural network with the extracted ROIs only. Then, by using these ROIs, a CNN network will be built and trained to classify the MRI scans into different groups (NC, MCI and AD). Instead of using raw MRI images, we feed the neural network with the extracted ROIs only. Therefore, the size of the input of the neural network is reduced and the requirement for memory is less than that of the methods using raw MRI images as input. This new method takes morphometric information into account but also avoids calculating the volume or feature directly. 3.3 Software & Environment According to the description of system design in Section 3.2, the output of this project is a system used for the diagnosis of Alzheimer s disease. The hardware and software involved in this project as well as their roles will be discussed in this section Hardware Table 3.1: Primary hardware configurations of developing and testing environment. Hardware Development Environment Test Environment [40] Central Processing Unit (CPU) Random Access Memory (RAM) Graphics Card Intel(R) Xeon(R) CPU E GHz 64GB NVIDIA Quadro K600 1 Intel Xeon E v3 2.30GHz 64GB NVIDIA Tesla K20X 4 Graphic Memory 981MB 5700MB 4 Disk 200GB 600GB This system is developed and evaluated on different computers since this is a computeintensive task. The resource, GPU, of our local computer is insufficient for the running

55 3.3. SOFTWARE & ENVIRONMENT 55 Figure 3.2: Architecture of the HOKUSAI system [40].

56 3.3. SOFTWARE & ENVIRONMENT 56 of this system. If the system runs on the CPU, it will take weeks to train a model. Table 3.1 shows the hardware configurations of development and test environments. This system is developed on a computer equipped with Intel(R) Xeon(R) CPU E v2, 64 GB RAM and one NVIDIA Quadro K600 graphics card. Then, it is tested and evaluated on HOKUSAI system, which is a supercomputer system developed by RIKEN [41], the largest research institution in Japan. This system consists of several main components called Massively Parallel Computer, Application Computing Server, Front-end servers and storages including online storage and hierarchical storage. Figure 3.2 shows the overview architecture of this system. In this project, Application Computing Server with GPU (ACSG) is used for testing. ACSG consists of 30 nodes and these nodes are interconnected to each other, so, it enables the Massively Parallel computing. Each node in the ACSG is equipped with Intel Xeon E v3 (2.30GHz) CPU, 64GB RAM, 600 GB disk and four NVIDIA Tesla K20X graphics cards. During testing and evaluation, the developer logs into the front-end server, submit the computing tasks to (ACSG) through RIKEN network and High-Performance Network (HPN). The results will be kept and stored in the developer s home directory. Since this system does not use the massively parallel computing technique, our system is tested and evaluated on one node only Software 1. Linux CentOS7 The operating system of the development and testing environments is Linux CentOS7. This is an open source, stable and reproducible platform developed based on Red Hat Enterprise Linux (RHEL) [42]. 2. Python 3.6 The developing language used to build this system is Python 3.6. Python is an open source, interpreted high-level programming language maintained by the non-profit Python Software Foundation [43]. The power of python partially results from its flexible and a great diversity of packages. In this project, a variety of third-party packages are used for the development of this system, and these are listed in the Table 3.2. The source code of this system can be found in Appendix B. 3. Functional Magnetic Resonance Imaging of the Brain Software Library (FSL)

57 3.3. SOFTWARE & ENVIRONMENT 57 Table 3.2: List of third-party packages of Python in this project. Name Numpy 1.0 torch post4 Nibabel Description A package for scientific computing [44]. It is used for MRI data processing and data augmentation and so on. A deep learning framework enables developers to build a neural network in a fast way and supports the use of GPU [45]. This package plays an important role in developing the neural network in this project. A package for access to various neuroimaging files [46]. It is used to read and convert MRI data into numpy array in this project. FSL is an open source and comprehensive library for the analysis and manipulation of a variety of brain imaging data [47]. It consists of many tools, each with a specific function for the process of MRI images. In this project, FSL is mainly used for data pre-processing. 4. Robust Brain Extraction (ROBEX) Robust Brain Extraction (ROBEX) is a command-line based software for wholebrain extraction on TI-weighted MRI data developed by Iglesias in 2011 [48]. It is free of parameter setting and able to produce a reliable result across datasets. This tool is used to strip the skull in the pre-processing stage. 5. CUDA and CUDA Deep Neural Network (cudnn) CUDA is a parallel computing architecture developed by NVIDIA for programming on the graphics card to perform the computing on graphical processing units (GPUs) [49]. CUDA enables the developers to take advantage of the power of GPUs to greatly speed up computing applications. CUDA Deep Neural Network (cudnn) is a task-specific library used for deep learning using CUDA [50]. With the GPU accelerated functionality, the process of training the deep neural network can be speeded up drastically. 6. Git Git [51] is a distributed version control system which is free of charge and open source. In this project, source code and files are managed by using git.

58 3.4. DATA Data Dataset As discussed in Section 2.1.3, the changes in the brain caused by AD are the atrophies of substructures of the brain. In order to diagnose AD, it is critical to check the atrophy of the brain without the examination of brain tissue. In this case, the MRI scan is an ideal media that is able to display the changes inside the brain in a visible way. Therefore, most of current existing CAD systems for AD use MRI/CT scans as input. The data used in this research is from The Alzheimer s Disease Neuroimaging Initiative (ADNI) database ( which was launched by Dr Michael W. Weiner in 2004 under the public-private partnership. ADNI recruits people, collects data, tracks the progression and enables the data to be shared between scientists, which makes great contributions to the AD research. All the data in this work is from the first phase of ADNI study called ADNI1, which was launched in 2004 with the help from 16 other institutes. The data in the ADNI1 study was collected by ADNI from 817 subjects including a different number of MCI, AD and Normal Control (NC) subjects [2]. There are total 1073 MRI images in this dataset, that means some of the subjects may have more than one image. The information for this dataset is shown in Tables 3.3 and 3.4. Table 3.3: Demographic data for the subjects from ADNI database [2]. Class Number Gender (Male/Female) Age(Mean std ) MMSE(Mean std ) AD / MCI / NC / Table 3.4: Distribution of the instances [2]. Class No. of Instances Normal Control 305 Mild Cognitive Impairment 525 Alzheimer s Disease 243

3.4. DATA 59 Figure 3.3: ICBM/MNI 152 template without skull [47]. 3.4.2 Montreal Neurological Institute average brain of 152 scans (MNI152) The basic structures of the brains are the same among people.

59 3.4. DATA 59 Figure 3.3: ICBM/MNI 152 template without skull [47] Montreal Neurological Institute average brain of 152 scans (MNI152) The basic structures of the brains are the same among people. However, brains differ greatly in detail like depth and position of sulcus and symmetry. In order to study and locate the substructure of the brains, a standard space, called Talairach space, was built by Jean Talairach, a French neuroanatomist. Talairach space was built based on one single female brain, so, it is not representative of the population. In order to define a more representative model of the human brain, Montreal Neurosciences Institute (MNI) created a new coordinate system and a series of templates, such as MNI 305 and MNI 152, by taking the average of many healthy MRI scans [52]. Usually, the MRI images collected from different people or generated by different machines are different to each other. Because the position of the brain in the images, scanning parameters and voxel size are different. Therefore, it is impossible to compare these scans in the native space. Moreover, the feature extracted from these scans are not valuable for statistical analysis or machine learning classification as well. In this case, a standard template is necessary to address this problem by registering raw MRI images to the standard template, so that these scans are comparable. The template used in this project is MNI 152 which is the average brain of 152 structural MRI scans from 152 healthy young adults. This template is approximately matched to the Talairach space. It is one of the most commonly used templates maintained by McGill University Health Centre. Moreover, MNI 152 is also adopted by the International Consortium for Brain Mapping (ICBM) that is why the name of this template is called ICBM 152 officially. Figure 3.3 shows the ICBM 152 template viewed from sagittal, coronal and axial sections.

3.4. DATA 60 Figure 3.4: Harvard-Oxford subcortical (left) and cortical (right) structural atlases. https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/atlases?action=attachfile&do=get&target= atlas-hosub.png

60 3.4. DATA 60 Figure 3.4: Harvard-Oxford subcortical (left) and cortical (right) structural atlases. atlas-hosub.png Harvard-Oxford cortical and subcortical structural atlases In this project, the structural atlases used to extract substructures of the brain is called Harvard-Oxford cortical and subcortical structural atlases included in FSL [47]. The probabilistic atlases contain total 69 cortical and subcortical structural areas [53]. This atlases derive from T1-weighted structural MRI scans of 37 healthy subjects. In order to create these atlases, semi-automated tools were used to segment these T1-weighted structural MRI scans one by one by the Harvard Center for Morphometric Analysis (CMA) [53]. Then, affine transformations were applied to register these T1-weighted images to MNI 152 template. At last, the population probability maps were created for each ROI through the combination across subjects. Probabilistic atlases represent the probability, ranging from 0% to 100%, that a voxel belongs to a specific substructure. Figure 3.4 is an example of atlases in MNI space included in the FSL toolkit [53]. The use of probabilistic atlases is a kind of trade-off since it is a great challenge to create a set of atlases to represent the entire human populations due to the variations of the structure and function between different human brains. Therefore, to use the probabilistic atlases is an ideal way to represent the brain for human populations across age and gender.

61 3.5. METHOD Method Brain extraction In data pre-processing, firstly, skull stripping is applied to MRI images to extract brain from the original images. Since this project focuses on the study of brain tissue, irrelevant parts like skull, neck and eyeballs need to be removed. Robust Brain Extraction (ROBEX) is a command-line based software for whole-brain extraction on TI-weighted MRI data developed by Iglesias in 2011 [48]. It adopts a combination of generative and discriminative models to complete the segmentation task. It is an automatic skull stripping tool with no parameter setting, and it provides a robust result across datasets. Another widely used brain extraction tool is FSL-Brain Extraction Tool (FSL-BET) [47]. FSL-BET is a module of FSL toolkit which is an analysis tool for MRI and Diffusion Tensor Imaging (DTI) brain images. This tool contains both Graphical User Interface (GUI) and command line interface, which make it convenient to use. However, it is difficult to get accurate results since several parameters have to be set, and expert skill and patience are needed [47]. As we can see, ROBEX is a parameter-free tool, and it can be used across datasets without the need for parameter-tuning. FSL-BET is a powerful brain extraction tool, however, the set of hyper-parameters and the need for skill make it difficult to use. Moreover, according to the research [48], ROBEX outperforms FSL-BET in several datasets, and ROBEX is more robust than its rival. ROBEX is able to produce highly accurate results across datasets, on the contrary, FSL-BET cannot produce comparable results without the process of parameter tuning. Therefore, ROBEX is selected to perform the skull stripping task in this project. Figure 3.5 shows an image before and after brain extraction Linear registration The raw MRI scans are not comparable in native space since the size of the voxel and the position of the brain in the images are different. In order to generate meaningful features for machine learning or statistical analysis, further processing is needed to register all the images to a standard template, MNI 152 in this project. The process of registration is to transform different MRI scans into the standard coordinate system. After registration, the voxel size, brain position and scale of these images are adjusted to be uniform, and the features extracted from images are comparable and meaningful.

62 3.5. METHOD 62 Figure 3.5: An MRI scan before (upper) and after (lower) brain extraction FMRIB s Linear Image Registration Tool (FLIRT) [54] is a module of FSL for automatic structural MRI brain image registration. This tool is used to perform linear registration using affine transformations like rotation, scaling and translation, which can keep the shape and proportion of a brain the same before and after transformation. Affine transformations are able to register the global differences, like global size, between images. According to the recommendation from FSL, FLIRT with 12 degrees of freedom is the appropriate choice for the process of affine transformation [47]. Each MRI scan is normalized and registered to the target standard space of MNI 152. After registration, all scans are in the same space and comparable to each other. The options used in this process and the corresponding explanations [54] are listed as follows -in: an input, the MRI image. -ref: a reference, the path of the MNI152 template image in this project. -out: output volume, the name of the output file. dof: 12 (degrees of freedom were used in the affine process.) cost: corratio (defines the intensity-base cost functions to quantify the similarity.) bins: 256 (specifies the number of the bins in intensity histogram) Figure 3.6 shows two MRI images before and after the affine registration. The raw images (a) are not in the same space, therefore it is impossible to make a comparison between them. After registration, the normalized images (b) are the same as the

63 3.5. METHOD 63 MNI 152 standard template (c) in size, shape and pose. In this case, it enables the comparison between these normalized images Segmentation In this project, segmentation is the process of segmenting an MRI 3D image of the brain into different parts according to the tissue types, namely Grey Matter (GM), White Matter (WM) and Cerebrospinal Fluid (CSF). Since the AD leads to the atrophy of substructure composed of grey matter. It is possible for us to focus on the grey matter and consider the other parts as noise. Therefore, it is reasonable to remove white matter and CSF and study the grey matter only. Two widely used tools, namely Statistical Parametric Mapping (SPM) [55] and FSL, are available for this task. FSL can segment the MRI image into GM, WM and CSF by using its sub-module called FMRIB s Automated Segmentation Tool (FSL-FAST). FSL-FAST uses the Expectation Maximization algorithm and the hidden Markov random field model in performing the segmentation task [56]. It contains both a GUI and a command-line program, and the input of this program should be an image after stripping the skull. SPM is a MATLAB software package for the analysis of several types of brain images like fmri, PET and MEG. In SPM, a modified Gaussian mixture model and the Bayesian rule are used to develop the segmentation function [57]. By using a Gaussian mixture model, the probability of a voxel belonging to a specific class can be calculated according to the intensity distribution of the image. Then, a Bayesian rule is used to produce the joint posterior probability to determine the correct class of this voxel. Both FSL-FAST and SPM are free of charge under the terms of the GNU License. However, since SPM is a suite of the MATLAB, a license for MATLAB is required and that is costly. Moreover, according to the research [58], as for the grey matter segmentation, the result produced by using FSL-FAST is better than that produced by using SPM. Therefore, FSL-FAST is selected as the tool for brain image segmentation. Figure 3.7 shows the segmentation results generated by FSL-FAST. The following list shows the parameters used on FSL-FAST command Fast <options> <input>. -t <t>: 1 (indicates the type of MRI scans used in this function, t=1 for T1- weighted image and t=2 means the use of T2-weighted image.) -n <n>: 3 (refers to the number of classes, GM, WM and CSF, for segmentation.)

64 3.5. METHOD 64 (a) (b) (c) Figure 3.6: Demonstration of affine registration. (a) Two MRI images after skull striping. (b) The corresponding images after affine registration. (c) The standard MNI 152 template after skull stripping [47].

) -l <n>: 20 (bias field smoothing parameter) -I <n>: 8 (number of main-loop iterations) -o <path>: the path and name of the

65 3.5. METHOD 65 (a) (b) (c) Figure 3.7: Examples of the segmentation process. (a) Grey matter. (b) White matter. (c) CSF. -H <n>: 0.1 (controls the spatial smoothness in main segmentation phase. A higher value gives a spatially smoother result.) -l <n>: 20 (bias field smoothing parameter) -I <n>: 8 (number of main-loop iterations) -o <path>: the path and name of the output. -B: output restored image. -b: an estimated bias field will be output if this option is given. <input>: the path of input image

66 3.5. METHOD ROI extraction The last stage of pre-processing is ROI extraction. In this stage, several substructures of each brain are extracted into one image. This stage consists of two main steps, the production of ROI masks and ROI extraction. Both steps are performed by using fslmaths, a tool of FSL used for brain image calculations such as addition, subtraction and multiplication and so on. Moreover, fslmaths can be applied to perform various image manipulations like binarization and thresholding and so on Production of ROI mask The ROI masks are produced based on the Harvard-Oxford cortical and subcortical structural atlases, which was discussed in Section The separated structural images are included in the FSL. The first step is to open fsleyes, a tool of FSL for visualizing MRI image. In this tool, users can choose the desired substructure and save it to an image one by one shown in Figure 3.8. In this project, according to the design, seven substructures of brain are extracted. Therefore, total seven substructure images are collected from fsleyes tool. Then, these separated substructure images are filtered, combined and binarized by using following fslmaths commands 1. fslmaths <input> -thr <threshold> <output> This command is used to threshold the substructure images created from Harvard- Oxford cortical and subcortical structural atlas. This atlas is a probabilistic atlas, each voxel contains the probability of belonging to a specific substructure of brain. In order to get a precise mask, it is possible to remove the voxels with a low probability. [38] suggested that the ROIs consisting of voxels with the probability between 0.3 and 1 can yield a better result. Therefore, the threshold is set to 0.3 in this project, and any voxel with a probability lower than 0.3 will be set to zero. 2. fslmaths <input1> -add <input2> <output> This command is used to combine the <input1> and <input2> into one image stored in the image named as <output>. By using this command, different substructure images are merged into one image. 3. fslmaths <input> -bin <output> The last step for the production of ROI masks is to binarize the image generated in the previous step. The intensities of the voxels in <input > image is set to

67 3.5. METHOD 67 Figure 3.8: Creation of an image for single substructure

3.5. METHOD 68 (a) (b) (c) Figure 3.9: Creation of ROI mask. (a) Example of substructure image (Left Hippocampus). (b) Combination of different substructure images. (c) Binarization of the image.

The <output> is the ROI mask, which will be used to extract ROI from the MRI images. Figure 3.9 shows the outputs produced by using the above commands in ROI producing phase. Figure 3.9(c) is the ROI mask which is produced by binarizing Figure 3.

68 3.5. METHOD 68 (a) (b) (c) Figure 3.9: Creation of ROI mask. (a) Example of substructure image (Left Hippocampus). (b) Combination of different substructure images. (c) Binarization of the image. either 1 or 0. If the intensity of a voxel is larger than zero, this intensity will be set to 1. Otherwise, the intensity will be set to 0. The <output> is the ROI mask, which will be used to extract ROI from the MRI images. Figure 3.9 shows the outputs produced by using the above commands in ROI producing phase. Figure 3.9(c) is the ROI mask which is produced by binarizing Figure 3.9(b). The ROI mask image looks bigger than Figure 3.9(b) because there are many nonobvious voxels in the latter image ROI extraction ROI extraction is the last operation in the data pre-processing and performed by using fslmaths tool. In this step, the ROI of each MRI image is produced one by one by using the MRI images and the ROI mask. The ROIs are generated by performing Hadamard product between the ROI mask and MRI scans. This operation is possible since both the MRI images and ROI mask have the same dimension and they are in the same space. In an ROI mask image, the intensities of the voxels belonging to the desired

3.5. METHOD 69 (a) (b) (c) Figure 3.10: Demonstration of ROI extraction. (a) An ROI mask. (b) An MRI image (grey matter) after linear transformation. (c) Extracted ROI.

69 3.5. METHOD 69 (a) (b) (c) Figure 3.10: Demonstration of ROI extraction. (a) An ROI mask. (b) An MRI image (grey matter) after linear transformation. (c) Extracted ROI. substructures is one, and the rest is zero. By multiplying the ROI mask and an MRI image, only the voxels belonging to the desired substructures in the MRI image remain the same, and the others become zero. Figure 3.10 shows an example of this procedure, and the command used to perform this operation is fslmaths <input1> -mul <input2> <output> <input1> is usually the MRI image used for training and testing. <input2> can be a real number or the data in the same dimension with <input1>. In this project, <input2> is the ROI mask. -mul option specifies the multiplication operation which performs the Hadamard product between <input1> and <input2>. <output> indicates the name and path of the output.

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,