An Evaluation Framework for the Comparison of Fine-Grained Predictive Models in Health Care Ward R.J. van Breda 1 ), Mark Hoogendoorn 1,A.E.Eiben 1, and Matthias Berking 2 1 VU University Amsterdam, Department of Computer Science De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands, {w.r.j.van.breda,m.hoogendoorn,a.e.eiben}@vu.nl 2 Friedrich-Alexander University Erlangen-Nuremberg, Germany matthias.berking@fau.de Abstract. Within the domain of health care, more and more fine-grained models are observed that predict the development of specific health or disease-related) states over time. This is due to the increased use of sensors, allowing for continuous assessment, leading to a sharp increase of data. These specific models are often much more complex than high-level predictive models that e.g. give a general risk score for a disease, making the evaluation of these models far from trivial. In this paper, we present an evaluation framework which is able to score fine-grained temporal models that aim at predicting multiple health states, considering their capability to describe data, their capability to predict, the quality of the models parameters, and the model complexity. 1 Introduction Predictive modeling is of utmost importance in health care and has potential to be the basis for prevention strategies, especially, early and highly personalized interventions. Predictive models can vary greatly in their level of granularity, ranging from relatively coarse-grained models, that e.g. provide a general risk disease score, to highly fine-grained ones that predict detailed developments of disease and/or disease-relevant states over time. The latter is received more and more attention due to the improvements in measurement capabilities. The evaluation of fine-grained models is unfortunately far from trivial as they often are of a temporal nature, predict multiple states, have parameters with which the model can be personalized, et cetera. In order to make informed decisions during model development, or when comparing models, there is a need to understand how model quality can be evaluated in a rigorous way. In this paper, we present an evaluation framework for fine-grained predictive models. The framework considers the following aspects: 1) the descriptive capability; 2) predictive capability; 3) parameter sensitivity, and 4) model complexity. The weight of each of these criteria can be set according to the characteristics of the specific disease or health aspects under consideration. c Springer International Publishing Switzerland 2015 J.H. Holmes et al. Eds.): AIME 2015, LNAI 9105, pp. 148 152, 2015. DOI: 10.1007/978-3-319-19551-3_18
Evaluation Framework for Fine-Grained Predictive Models 149 2 Evaluation Framework Scope of Framework. The framework we describe is meant for domains with temporal data over a number of attributes a 1,...,a m for a number of patients b 1,...,b n. Attributes represent an aspect of the health state of a patient, and we denote the domain of a i by A i.atanygivetimet, the state of a patient b is a vector sb, t) A 1... A m. To designate the specific value of an attribute we use the notation sb, t, i) A i. We assume a dataset that contains a state for each patient for a number of time instances t 1,...,t end. These measured data are contained in the matrix Z, where at any give time t j j =1,...,end) the observed state of patient b is a vector zb, t j ) A 1... A m. Furthermore, we consider models composed from rules for state transition and assume one rule for each of the given attributes that describes the value of a i t + 1) based on a i t) and possibly some other attributes. Formally, we have r i : A 1... A m A i. Obviously, a transition rule for attribute i may not need all other attributes, only a few of them; in the extreme case only a i itself. AmodelM is then a composed entity that can predict the consecutive states of any give patient. Thus, M : A 1... A m A 1... A m and given a state sb, t) ofpatientb at time t we can denote the predicted state at time t +1by Msb, t)). Each model is equipped with parameters, a model instance is a model for which a value has been assigned to each parameter denoted as M p where p P and P is the set of possible parameter value vectors. To employ a model M for predicting a full sequence of states for a patient b it needs to be applied to the first observed state zb, t 1 ) and then iteratively to the outcomes. To avoid an unnecessarily heavy notation, we define these predicted states interactively as follows. Given a patient b and a model instance M p, the predicted state at the start is the observed state s Mp b, 1) = zb, 1) and for all t =1,..., end 1 s Mp b, t +1)=M p s Mp b, t)) The goal of the model instance is to minimize the difference between the predicted and observed states. The error of a model M for a patient b over the full observation period is then eb, M p )= t end Dzb, t),s Mp b, t)) t=t 1 where D is some application specific measure of difference between two states. Furthermore, we define eb, M p,i) defines the error the model makes on an attribute i. Throughout this paper, the Mean Squared Error MSE) is assumed to be the error measure. Descriptive Capability. To express the quality of a model M for a given patient b we consider the error eb, M, a i ) on each attribute a i.thisimpliesa multi-objective optimization MOO) problem, where each objective corresponds to one attribute. In the sequel we assume that we have and use a multi-objective
150 W.R.J. van Breda et al. optimization algorithm that searches in the space of model instances M p based on a training set. For examples of MOO algorithms, see [2], [3]. Given a patient b, the output of one run of the algorithm is a set of non dominated model instances, where dominance is defined in the usual manner: M p dominates M p,ifforeach attribute a i the error eb, M p,i) is lower or equal than eb, M p,i)andthereis at least one attribute a j where the error eb, M p,j)islowerthaneb, M p,j). Because each model instance corresponds to one point in the space of model parameter vectors, the set non-dominated model instances forms a Pareto front in this vector space. 1 Due to the typical stochastic nature of the algorithms to find such assignments it is assumed that multiple runs of the algorithm are performed per patient. Each run r delivers a Pareto front of q non-dominated model instances {M p1,r,..., M pq,r } and each model instance M pi,r has a corresponding m-dimensional error vector eb, M pi,r, 1),..., eb, M pi,r,m), where m is the number of attributes. Each of these q vectors can be plotted using an attainment surface [1].Givensuchanattainmentsurfaceitsdominatedhypervolumecanbecalculated, which is the volume above each attainment surface, related to an error reference point, set to 1 for each objective, based on the assumption that all error values are scaled to an interval of [0,1]. For more details, see [1]. We use the notation nh M b, r) to denote the dominated hypervolume for run r of the MOO algorithm to optimize model M for patient b. Executing r max runs of the MOO algorithm, we obtain r max attainment surfaces and r max values of nh M b, r). Taken all runs into account we now can define model quality by averaging the values of the non-dominated hypervolumes: nh M b) = rmax r=1 nh M b,r) r max Each patient has its own unique value, and we therefore end up with a vector of n values: nh M b 1 ),..., nh M b n ). The final score of the criterion is then defined as multiplication of the mean μ nhm with 1 minus the standard deviation σ nhm of the values in this vector since a high mean is good in combination with a small standard deviation: descriptive score M = μ nhm 1 σ nhm ) Predictive Capability. Next to a good descriptive capability, we also want the model to perform well on the test set, i.e. have a good predictive capability. Hereto, we define two measurements: 1) we define the absolute predictive performance on the test set, and 2) we quantify the relationship between how well the model performs on the training and test set, we would prefer these to go hand-in-hand and call this the relative predictive performance. Concerning absolute predictive performance, for a model M we calculate the mean μ em j) and the standard deviation σ em j) over the set of errors belonging to attribute j for the q model instances per run and r max runs, for all patients i.e. 1 to n), resulting in a set size of q r max n errors. We then take the average of the mean 1 The dimensionality of this vector space depends on the number of model parameters.
Evaluation Framework for Fine-Grained Predictive Models 151 and standard deviation over all m attributes: μ em and σ em. Then, the absolute predictive performance score becomes: absolute pred score M =1 μ em )1 σ em ) To measure the relation between the errors on training and test set, for the q model instances per run and r max runsweendupwithatotalofq r max model instances with specific parameter vectors per patient b. For each attribute j we determine the correlation between the error of each of the q r max model instances on the training and the test set for each individual patient whereby the training set is the first period of data and the test set the second, later, period): e train b, M pi,r,j)ande test b, M p1,r,j). Hereto, we use the correlation note training and test to tr and te respectively) for a specific model M: cor Mb, j)= q rmax e tr b,m pi,r,j) e tr b,m pi,j) e tr b,m pi,r,j) e tr b,m pi,j) ) 2 q ) ) e teb,m pi,r,j) e teb,m pi,j) rmax ) 2 e teb,m pi,r,j) e teb,m pi,j) Then, we calculate the average μ corm across the set of all correlations of all patients i.e. 1 to n) and criteria i.e. 1 to m): cor M b 1,a 1 ),..., cor M b n,a 1 ),..., cor M b n,a m ) as well as the standard deviation σ corm ) 2 : relative pred score M = maxμ corm, 0)1 σ corm ) Given weights w 1 and w 2 a final evaluation score for model M is calculated: predictive score M = w 1 absolute pred score M + w 2 relative pred score M Parameter Sensitivity. The parameter sensitivity is the most complex metric. In the current version we keep it as simple as possible. We want to avoid meaningless parameters, that do not have any influence on the performance of the model. Therefore we look at the relationship between parameters and the various evaluation objectives. Assuming we define p i,r,b u) as the value for parameter u for model instance i from during run r of patient b, we define correlation between a parameter and the resulting error on an attribute j for model M as follows: cor M b, j, k) = e trb,m pi,r,b,j) e tr b,m p,j))p i,r,b k)) p b k)) e trb,m pi,r,b,j) e tr b,m p,j)) 2 p i,r,b k)) p b k)) 2 If a parameter always has a correlation close to zero for different evaluation criteria and patients) this is considered bad. Thus, we look for the maximum of the absolute value of the correlation found in the set of all patients and criteria, which gives an indication whether a parameter has a use i.e correlation) in at least one patient/model instance combination. We define a correlation as useless or weak) if it falls under a boundary of 0.35 [4]. { 1 maxb [1,n],j [1,m] cor useful M,pk) = M b, j, k) ) 0.35 0 otherwise 2 Note that with respect to the mean a cutoff point of 0 is used via the max operator) as we consider all correlations below 0 equally bad.
152 W.R.J. van Breda et al. Finally, we calculate the fraction of useless parameters in a model as its score for the parameter sensitivity where P is the number of elements in the parameter vector and is the highest parameter number of the model): sensitivity score M = P useful M,pk)) k=1 P Model Complexity. The model complexity counts the number of states and parameters: A complexity score M =1 M + P M ) max mo Models A mo + P mo ) The score is scaled according to the maximum complexity of the models that are subject to evaluation. 3 Discussion This paper has introduced a framework which is able to evaluate fine-grained temporal predictive models for health care. Hereto, several criteria have been introduced which can be combined by taking a weighted sum of the different scores, thereby selecting weights depending on the importance of the criterion for the case at hand. Initial experiments suggest that the framework is able to generate important insights in the properties of the models. For future work we want to test and refine the framework by further investigating the usefulness and performance of the different criteria and related metrics. Furthermore, we want to use the framework as a fitness function for automatically generating predictive dynamic models using genetic programming techniques. Acknowledgements. This research has been performed in the context of the EU FP7 project E-COMPARED project number 603098). References 1. Deb, K.: Multi-objective optimization using evolutionary algorithms, vol. 16. John Wiley & Sons 2001) 2. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation 62), 182 197 2002) 3. Mlakar, M., Petelin, D., Tušar, T., Filipič, B.: Gp-demo: Differential evolution for multiobjective optimization based on gaussian process models. European Journal of Operational Research 2014) 4. Taylor, R.: Interpretation of the correlation coefficient: a basic review. Journal of Diagnostic Medical Sonography 61), 35 39 1990)