Dipartimento di Informatica e Scienze dell Informazione

Size: px

Start display at page:

Download "Dipartimento di Informatica e Scienze dell Informazione"

Kelly Skinner
6 years ago
Views:

1 Dipartimento di Informatica e Scienze dell Informazione A Statistical Learning approach to Liver Iron Overload Estimation Luca Baldassarre 1,2, Barbara Gianesin 1, Annalisa Barla 2 and Mauro Marinelli DIFI - Università di Genova, v. Dodecaneso 33, 16146, Genova, Italy 2 - DISI - Università di Genova, v. Dodecaneso 35, 16146, Genova, Italy Technical Report DISI-TR DISI, Università di Genova v. Dodecaneso 35, Genova, Italy 1

2 A Statistical Learning approach to Liver Iron Overload Estimation Luca Baldassarre, Barbara Gianesin, Annalisa Barla and Mauro Marinelli Abstract In this work we present and discuss in detail a novel vector-valued regression technique: our approach allows for an all-at-once estimation of the vector components, as opposed to solve a number of independent scalar-valued regression tasks. Despite its general purpose nature, the method has been designed to solve a delicate medical issue: a reliable and non-invasive assessment of body-iron overload. The Magnetic Iron Detector (MID) measures the magnetic track of a person, which depends on the anthropometric characteristics and the body-iron burden. We aim to provide an estimate of this signal in absence of iron overload. We show how this question can be formulated as the estimation of a vector-valued function which encompasses the prior knowledge on the shape of the magnetic track. This is accomplished by designing an appropriate vector-valued feature map. We successfully applied the method on a dataset of 84 volunteers. 1 Introduction The iron is essential to human life, but is toxic in excessive amounts. There are several diseases characterized by liver iron overload, such as thalassemia or hereditary hemochromatosis. The assessment of body iron excess is therefore essential for managing the therapies for these diseases. Each disease is characterized by a different mechanism of iron accumulation. For thalassemic patients iron overload is induced by periodical blood transfusions and for hemochromatosis patients it is induced by an incorrect dietary absorbtion of iron. The invasive liver biopsy is still considered the best way to perform iron overload evaluation, but being a local measure, it is affected by large errors due to the heterogeneous distribution 2

3 3 PATIENT x[cm] Eddy Current Signal 3 R ( x) = Magnetization Signal x[cm] VOLUNTEER μv MAGNETIZATION -2 EDDY -3 CURRENT -4-5 μv 1.1: Volunteer and patient magnetizaton signals. 1.2: Eddy current and magnetization signals of a volunteer. of iron deposition in the liver. Recently, Marinelli and colleagues [9] have developed a roomtemperature biosusceptometer, the Magnetic Iron Detector (MID), which measures the variation of a magnetic field at different positions along the axis that crosses the patient s liver. This instrument allows the non-invasive assessment of the iron overload in the whole liver. Given an estimate of the background signal of a patient, that is the signal that would be generated in absence of iron overload, it is possible to recover the iron burden by subtracting the estimated signal from the measured signal. The statistical model developed by Marinelli and coworkers [8] is currently used at the E.O. Ospedali Galliera Hospital in Genoa, Italy, for assessing the iron overload. The model has been trained on a dataset of 84 healthy volunteers and it estimates the ratio R(x) between the two signals shown in subfig.1.2, of which only the magnetization signal depends on the liver iron content. The core idea behind their approach is that the magnetization signal of a well-treated patient is indistinguishable from the one generated by a healthy volunteer with the same biometric features, see subfig.1.1. Furthermore, they assume that the ratio R(x) of the two signals, evaluated only in the range between -8cm and 8cm, resembles a parabola. We reformulate this problem in the context of Statistical Learning presenting a method to transform a curve fitting task into a vector-valued regression model. Since the measures are always taken at fixed positions along the measurement axis, they can be thought of as components of a vector and a high correlation among them can be assumed, because they approximately lie on a parabola. In this way we eschew from directly estimating the magnetization curve. Our 3

4 vector-valued regression model simultaneously estimates the five points of the background signal. The correlation between these points is introduced by an appropriate feature map, which is linear in the biometric features and quadratic in the measurement positions. As we will show in subsec.3.2, we can compute the corresponding matrix-valued kernel function The method described in Sec.3.2 can be implemented by means of iterative algorithms, see [7], such as Landweber, ν-method or the sparsity enforcing l1l2 regularization [3; 4]. 2 Liver Iron Overload Estimation The biosusceptometer is composed of an AC magnetic source and a pickup coil which measures the electromotive force emf produced by the oscillation of the magnetic field flux, as shown in subfig.1.3. A body placed between the magnet and the pickup slightly modifies the flux and therefore the emf measured: the amount of the variation depends on the magnetic properties of the body, on its geometry and on its positioning in the field. The emf produced in the pickup by the field is about 4V and the diamagnetic signal of a body is about 4µV. A moderate iron overload adds a paramagnetic contribution of about 0.4µV. The symmetry of the system and the use of synchronous detection make this difficult measurement feasible [8]. A sample of susceptibility χ, placed between the magnet and one of the pickups, generates the signal: V = V olume χ( r)g( r)d r (1) The weight function g( r) is reported in subfig.1.4: all body tissues contribute to the signal generation, but the major contribution comes from those between the magnet and the pick up coil (measurement region). The patient lies supine on a stretcher, such that in the measurement region the liver center of mass crosses the magnetic field axis. The magnetic track of the patient is the complete scan of the magnetic properties of the body section. It is composed of the measurements of the magnetic signal 4cm apart. Position x = 0cm corresponds to the center of the body; negative positions indicate the liver side, while positive positions the spleen side. Subfigure 1.1 reports the magnetic tracks of a healthy volunteer and of a patient with similar anthropometric features: the liver iron overload produces an evident 4

A Magnet A g(r ) g max B Pickup g( r ) gmax 0.13 1.0 B g( r ) g max 1.3: Magnet and pickup configuration. 1.4: Weight function g( r) and its cross section in the direction A and B.

With this positioning, the whole liver falls in the region where the function g( r) is greater. variation of the signal in the left part of the track.

5 A Magnet A g(r ) g max B Pickup g( r ) gmax B g( r ) g max 1.3: Magnet and pickup configuration. 1.4: Weight function g( r) and its cross section in the direction A and B. Figure 1: When the patient is positioned between the magnet and the pickup a emf is produced by the magnetization on the tissues of the patient. With this positioning, the whole liver falls in the region where the function g( r) is greater. variation of the signal in the left part of the track. Due to the oscillating magnetic field, the magnetic signal generated by the human body has two independent sources: the magnetization signal, from the diamagnetic and paramagnetic properties of the tissues, and the eddy current signal, from their electrical conductivity. For each patient a double track is recorded (an example is shown in subfig. 1.2): only the magnetization signal depends on the iron overload. The aim of the present work is to find a model which best approximates the magnetization signal of a healthy volunteer (background signal) from its anthropometric data and the eddy current signal. The iron overload can be evaluated by computing the difference between the measured magnetization signal and the estimate of the background signal. Therefore, increasing the prediction accuracy of the model, increases our ability to detect slight iron overloads. However, the maximum accuracy we can obtain is limited by the measurement error, mainly due to the positioning of the patient on the stretcher. This error is about 150nV and corresponds to an iron overload of 0.4g 5

6 3 Statistical Learning Approach 3.1 Non-parametric regression and regularization Given a set of input-output examples z = {(x 1, y 1 ),..., (x n, y n )}, x i X and y i Y, the aim of statistical learning is to find the deterministic function that best represents the relationship between x and y. This function is called the regression function. To tackle this problem it is necessary to define the space of candidate functions (hypothesis space, H) and which measure to use to assess the goodness of a candidate (loss function, V ). Ideally one would like to find a function that minimizes the loss function on all the possible input-output pairs. Since one always deals with finite sets, one has to resort to minimize the empirical risk, that is the loss calculated only on the examples z: E n (f) = 1 n n V (y i, f(x i )). i=1 If the hypothesis space is large enough to accommodate for almost any sensible function, it will always be possible to perfectly predict (or fit) the examples z, without any guarantee that this minimizer will perform well on unseen data. This problem is called overfitting. There are two alternative approaches to avoid incurring in overfitting: the first consists in restricting the hypothesis space, the second in favoring smooth functions. It is usually more straightforward to translate our prior information on the nature of the specific problem at hand into properties of the regression function than into a restriction of the hypothesis space. We prefer to follow the second approach. Convenient hypothesis spaces are the Reproducing Kernel Hilbert Spaces (RKHS), which, for specific choices of the kernel function, K, are dense in L 2 (X ). The norm in these spaces normally reflects the L2 norm: a smaller norm corresponds to smoother functions. It is therefore natural to search for those functions that minimize the following functional: 1 n n V (y i, f(x i )) + λ f 2 K (2) i=1 The regularizing parameter, λ, controls the trade off between the two terms: let λ reach zero and we obtain the interpolating solution, let λ grow and the solution will be smoother and smoother. This approach is called Tikhonov regularization. In a RKHS the representer theorems [5; 10] 6

7 guarantee that the solution can always be written as: f(x) = n K(x, x i )c i i=1 where the coefficients c i depend on the data, on the loss function, on the kernel choice and on the regularization parameter λ. Note that, if Y R then the c i and K(x, x i ) are scalar, whether if Y R d the c i are d-dimensional vectors and K(x, x i ) is a d d matrix. The direct approach for vector-valued regression, as in [10], is computationally expensive since it requires to invert a nd nd matrix. To overcome this issue, we propose an extension to the vector-valued case of iterative methods originally developed for scalar regression [12; 7]. The main idea of these techniques is to start with an approximate solution and iteratively add a correction in the direction opposite to the gradient of the empirical risk. Letting the iterations go to infinity leads to an overfitted solution, with its problems of stability and generalization. By early stopping the procedure, a regularized solution is achieved. The number of iterations m plays the role of the regularization parameter. We are also interested in studying feature selection on vector-valued functions. We implemented the l1l2 regularization, a sparsification method initially proposed by [13], studied in [3] and already applied in [6]. This method iteratively minimizes the following functional, derived from (2) with the square loss and the addition of a l1 penalty term: 1 n n (y i f(x i )) 2 d + λ(1 α) f 2 l2 + λα f l1. i=1 3.2 Designing the feature map For each person i = 1,..., 84, we consider only the 5 measures y ik at positions t k = { 8, 4, 0, 4, 8}cm. The measures can be thought as the components of a five-dimensional vector and lie approximately on a parabola, hence we can model them as y ik = f(x i ) k + ɛ ik, where x i stands for the biometric data of the volunteer i, ɛ ik represents the noise and: f(x i ) k = c 0 (x i ) + c 1 (x i )t k + c 2 (x i )t 2 k + b(t k), (3) where b(t) represents a measure offset independent of the volunteer. To avoid computing b(t k ), we choose to set the mean of the values y ik to zero for each k. 7

8 In our model, we assume that the coefficients c j depend linearly on x: c j (x) = β j x, j = 0, 1, 2 and introduce the vector-valued feature map, ϕ : X R 5 3p, (X R p, p = 22): x xt 1 xt 2 1 x xt 2 xt 2 2 ϕ(x) = x xt 3 xt 2. (4) 3 x xt 4 xt 2 4 x xt 5 xt 2 5 Let us define β as the vector obtained by concatenating the coefficient vectors β j. Note that the element β lj is the l-th component of the coefficient vector β j. The vector-valued estimator can be rewritten as a linear combination of these new features: f(x) = ϕ(x)β, β R 3p. (5) We decided to use the quadratic loss function V, therefore the empirical risk is: E n (β) = 1 n n y i ϕ(x i )β 2 R. 5 i=1 Our aim is to compare the performance of the MID parametric model versus the vector-valued model estimated via the Landweber, ν-method and l1l2 algorithms. These methods require the computation of the empirical risk gradient, which, for this specific case, is: E n (β) = 2 n (ϕy ϕt ϕβ) (6) n (ϕy ) γ = < ϕ γ (x i ), y i > R 5 (ϕ T ϕ) γ,γ = i=1 n < ϕ γ (x i ), ϕ γ (x i ) > R 5 i=1 where ϕ γ (x) corresponds to the γ-th row of ϕ(x), ϕy R 3p and ϕ T ϕ R 3p 3p. The simplest iterative method is the Landweber approach [1]. It starts with the null solution (i.e. all the coefficients β lj equal to zero), which is updated by adding the negative of the gradient multiplied by a constant step size, η: β m+1 = β m η E n (β m ), β 0 = (0,..., 0) 8

9 The number of iterations m corresponds to the inverse of the regularization parameter λ. The ν method [7] extends Landweber by using a dynamic step size and introducing an inertial term which keeps memory of the previous update: β m+1 = β m + u(β m β m 1 ) wη E n (β m ), where w and u change at each iteration. It has been shown that this algorithm performs better and faster than Landweber: in fact, the number of iterations corresponds to λ 2. l1l2 regularization iteratively minimizes the following functional: 1 n n y i ϕ(x i )β 2 R + λ(1 α) β 2 5 l2 + λα β l1 i=1 The l1 penalty term forces many of the coefficients β lj to be zero and the corresponding variables can be considered irrelevant to the problem and discarded. The iterations are essentially of the Landweber type, but at each step the coefficients are soft-thresholded and shrunk: β m+1 = H(β m η E n (β m ), τ)/(1 + µ) τ = λα µ = λ(1 α) where H is the soft-thresholding operator, which sets to zero all coefficients within [ τ, τ] and shifts towards zero by τ the remaining coefficients. The algorithm stops accordingly to a convergence criterion, for details see [3]. From the vector-valued feature map ϕ we can calculate the corresponding matrix-valued kernel. Following [2]: 5 (K(x, s)) pq = ϕ kp (x)ϕ kq (s) k=1 = (x s)(1 + t p t q + t 2 pt 2 q). Note: We can recast the vector-valued model into a scalar one, by considering for each volunteer five input points (x, t k ), one for each measurement position, and using the factorized scalar kernel: K((x, t p ), (s, t q )) = (x s)(1 + t p t q + t 2 pt 2 q) 9

10 It is possible to extend our approach to the non-linear case by replacing the dot product with a suitable scalar kernel function. The estimator can be written as: f(x, t) = n 5 K((x, t), (x i, t k ))c ik i=1 k=1 With this approach one can use standard scalar regularizing regression techniques, but there are some considerations to be made. The first regards the i.i.d hypothesis on the examples. Reformulating the problem as scalar regression, each volunteer is associated with 5 vectors composed of two parts. The first is the biometric data of the volunteer, x i, while the second part is the measurement position, t k. Consequently, the training set has 5 n elements, whose biometric and position components are not i.i.d. Furthermore, enforcing sparsity on the coefficients c ik is very different than sparsifying the coefficients β lj, which are directly related to the biometric features. 3.3 A naïve scalar approach For comparison, we tested our model against a naïve approach, which consists in treating the measures at each position as independent scalar regression problems. Therefore, five scalar models are trained separately and their outputs combined to recover the background signal. The prior knowledge that the magnetic signal for each subject is roughly a parabola, is no longer taken into account. We implemented standard RLS regression with polinomial and gaussian kernel and l1l2 regularization. 3.4 Model selection and assessment We adopt an experimental protocol in order to select the model parameters and assess the generalization capabilities of our method in an unbiased way. We perform two nested loops of K-fold Cross Validation. We recall that the estimate of the generalization error is the mean of the empirical errors on the K test sets. If K equals the total number of data available, the method is called Leave One Out Cross Validation (LOO). Higher values of K reduce the bias of the estimator, since the model is trained on more data, but increase its variance, since fewer 10

11 data are used for testing. On the other hand, more splits imply more computations, hence more time.in some cases, RLS for example [11], closed form solutions of the Leave One Out Error have been obtained, resulting in very fast computation. For the vector-valued model, the inner loop is a 5-fold Cross Validation and is performed to select the regularizing parameter (e.g. λ or the number of iterations, m). For each value of the parameter, an estimate of the generalization error is computed. The value that minimizes the error is used for training. The outer loop is a Leave One Out Cross Validation evaluating the performance of the chosen model. The estimate of the generalization error is the mean of the K = n empirical errors. For the RLS scalar models, the inner loop is a LOO CV for selecting both the kernel parameter and the regularizing parameter, λ, exploiting the computational advantage of the closed form solution for the LOO error for the latter. The selection of λ for the scalar l1l2 regularization method was performed by a 5-fold CV. The evaluation of the performce of these models has been carried out through LOO CV, in correspondence to the procedure adopted for the vector-valued model. 4 Results The data set is composed of 84 healthy volunteers represented by vectors of features which are reported in Table 1. Note that from now on we will refer to n as the number of examples in the training set within the innermost loop of CV. The features are highly inhomogeneous and can lead to numerical problems, therefore we decided to normalize our data. We set the columns of the n p data matrix X = (x 1,..., x n ) T, to have zero mean and fixed range and changed the variable t from { 8, 4, 0, 4, 8} to { 1, 0.5, 0, 0.5, 1}, since it only represents a label for the components of the vector y. Thus, each element of the 3d matrix ϕ(x) R n 3p 5, obtained by applying the feature map to the data matrix X, belongs to [ 1, 1]. In the test phase, we apply these normalizing factors to the test data. For model selection and assessment we used the experimental protocol outlined in subsec.3.4: the model parameters to be selected are the number of iterations m for the Landweber and ν-method algorithms and the regularizing parameter λ for the l1l2 method. In the latter case, 11

12 Table 1: Volunteer s features Feature Description 1 Eddy current at -12 cm 2 Eddy current at -8 cm 3 Eddy current at -4 cm 4 Eddy current at 0 cm 5 Eddy current at 4 cm 6 Eddy current at 8 cm 7 Eddy current at 12 cm 8 Thorax section area at 0 cm 9 Thorax section area at 18 cm 10 Thorax section area at -18 cm 11 Thorax height at 0 cm 12 Thorax height at 18 cm 13 Thorax height at -18 cm 14 Adam s apple position 15 Navel position 16 Age 17 Height 18 Weight 19 Thorax circumference 20 Circumference under ribs arch 21 BMI (Body Mass Index) 22 Body area 12

13 α was set to 0.9 to enforce maximum sparsity while retain correlated features [3]. For scalar RLS regression, we also selected the kernel parameters (the degree of the polynomial or the σ of the gaussian) alongside the regularization parameter λ. The implemented iterative algorithms require the specification of a step size η. We choose the value η = (2 ϕ T ϕ ) 1 which guarantees their convergence, see [12; 3; 7]. We report the selected parameters in Table 2 for the vector-valued model and in Table 3 for the naive approach. Note that the values correspond to the median of the parameters for each model learnt during the outer loop of LOO cross validation. Table 2: Selected parameters Model Number of iterations λ Landweber 397 n.a. ν-method 68 n.a. l1l Table 3: Selected parameters - Naive approach Model RLS gaussian RLS polynomial l1l2 Position λ Par λ Par λ Par x x x x x Figure 2 shows the boxplots for the LOO errors distributions for all the models tested, compared with the model in use at the E.O. Ospedali Galliera Hospital, assessed with the same validation protocol. As expected, we observe that the LOO errors show a high variance. Table 4 summarizes the statistics for each distribution. The l1l2 algorithm performs slightly better and seems more robust to outliers, both for the vector-valued model and for the naïve approach. The accuracies obtained with these methods 13

14 Table 4: Statistic summary for the LOO errors distributions Model I quartile Median III quartile Landweber ν-method l1l MID RLS gauss RLS polynomial L1L correspond to a precision in the iron overload estimation of about 0.8g. Iron overload lower than 1g is considered mild: currently no model is capable to detect this kind of iron burden. 5 Conclusions The model proposed is a general method to approach vector-valued regression problems. Moreover, it can also be used to estimate a curve explained by a variable that is always sampled at fixed values. Prior knowledge (e.g. the shape of the curve w.r.t. the parametrizing variable, or the correlation among the elements of the vector-valued function to be estimated) can be easily incorporated by explicitly writing the feature map or the kernel function. Our results show that the iterative algorithms can be applied to the vector-valued case with success. They also provide an efficient alternative to the direct computation of the inverse of a nd nd matrix. The model selection and validation protocol adopted leads to an unbiased solution, avoiding overfitting and unreliable estimates of the performance. The Marinelli group will soon start a new data acquisition campaign on volunteers: the old features will be measured more accurately and some new ones will be introduced, for example a 3d laser scan of the volunteer s thorax. The statistical methods here exposed will be employed on the new dataset and will be compared against a neural network system that the Marinelli group is planning to develop. 14

15 Absolute value of residue Landweber nu method L1L2 MID RLS gauss RLS poly L1L2 scalar Model Figure 2: LOO errors distributions. The first three models are obtained from the vector-valued one by the indicated algorithms. The MID model is the one currently used for diagnosis. The last three boxplots regard the naïve scalar approach. References [1] P. Bühlmann and B. Yu. Boosting with the l2- loss: Regression and classication. Journal of American Statistical Association, 98: , [2] A. Caponnetto, C. Micchelli, M. Pontil, and Y. Ying. Universal kernels for multi-task learning. Journal of Machine Learning Research, submitted. [3] C. De Mol, E. De Vito, and L. Rosasco. Sparse tikhonov regularization for variable selection and learning. Technical report, DISI, [4] C. De Mol, S. Mosci, M. Traskine, and A. Verri. A regularized method for selecting nested groups of relevant genes from microarray data. Technical report, DISI,

16 [5] E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri. Some properties of regularized kernel methods. Journal of Machine Learning Research, 5: , [6] A. Destrero, S. Mosci, C. D. Mol, A. Verri, and F. Odone. Feature selection for high dimensional data. Computational Management Science, to appear. [7] L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Computation, to appear. [8] M. Marinelli, S. Cuneo, B. Gianesin, A. Lavagetto, M. Lamagna, E. Oliveri, G. Sobrero, L. Terenzani, and G. Forni. Non-invasive measurement of iron overload in the human body. IEEE Transactions on applied superconductivity, 16(2), June [9] M. Marinelli, B. Gianesin, M. Lamagna, A. Lavagetto, E. Oliveri, M. Saccone, G. Sobrero, L. Terenzani, and G. Forni. Whole liver iron overload measurement by a non-cryogenic magnetic susceptometer. In Proc. of New Frontiers in Biomagnetism, Vancouver, Canada, [10] C. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Computation, 17, [11] R. M. Rifkin and R. A. Lippert. Notes on regularized least squares. Technical report, MIT Dspace [ (United States), [12] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2): , August [13] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2),

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many