Mining in Hepatitis Data by LISp-Miner and SumatraTT

Mining in Hepatitis Data by LISp-Miner and SumatraTT Petr Aubrecht 1, Martin Kejkula 2, Petr Křemen 1, Lenka Nováková 1, Jan Rauch 2, Milan Šimůnek2, Olga Štěpánková1, and Monika Žáková1 1 Czech Technical University in Prague, FEE, Prague 6, Czech Republic, {aubrech,kremep1,novakova,step}@felcvutcz 2 University of Economics, Prague, W Churchill Sq 4, 130 67 Praha 3, Czech Republic {kejkula,rauch,simunek}@vsecz Abstract The paper suggests a methodology for search of temporal patterns, which is tested on the problem of difference between hepatitis B and C To reach this goal two software systems LISp-Miner and SumatraTT are combined sophisticated data transformations and enhancements are designed and ensured through SumatraTT while LISp- Miner takes care for the search of significant interesting differences in the resulting datasets The main obtained results are reviewed in the section 3 there are identified some examinations the values of which significantly differ for both types of attributes This proves that the suggested general methodology has promising potential when applied to the considered type of data A plan for additional data-mining questions to be studied later is presented in the Conclusions 1 Introduction This paper presents results of mining in the hepatitis data set that was offered as a part of Discovery Challenge at PKDD 2005 We try to contribute to discovering differences in temporal patterns between hepatitis B and C Similar question has been analyzed by [2]; while they are using temporal abstraction, we introduce trend characteristics (see section 22) which are calculated during data preprocessing ensured by SumatraTT and we apply LISp-Miner to find relevant association rules (see section 3) SumatraTT 3 [8] is a modular system for data preprocessing and data transformations It offers a number of easy-to-use reusable modules for loading or exporting data from/to different formats, for analysis of individual attributes (elementary statistics, contingency tables, etc) and for definition of additional derived attributes through sophisticated processing (scripting, integration with SQL databases etc) The last property represents one of the main advantages of SumatraTT for the considered data-mining task: to characterize the temporal 3 homepage: http://krizikfelkcvutcz/sumatra

patterns in the measured attributes of patients, it proved necessary to introduce number of new derived attributes they have been obtained through transformations and aggregation ensured by SumatraTT The individual modules can be combined into a project through an intuitive graphical interface which creates automatically a detailed documentation as a by-product In this way, the SumatraTT project becomes an efficient communication platform for our team during the work on any data-mining task LISp-Miner is an academic software system intended to support teaching and research It consists of six data mining procedures, machine learning procedure KEX and two procedures for data transformation [5, 7] see also http: //lispminervsecz All six data mining procedures of LISp-Miner system are GUHA procedures in the sense of [1] The input of GUHA procedure consists of the analyzed data and of a simple definition of relevant (ie potentially interesting) patterns GUHA procedure automatically generates each particular pattern and tests if it is true in the analyzed data The output of the procedure consists of all prime patterns The pattern is prime if it is true in the analyzed data and if it does not immediately follow from other simpler output patterns [1] Each GUHA procedure of the LISp-Miner system mines for a particular type of patterns Its most frequently used procedure is 4ft-Miner, which mines for enhanced association rules [5] In this paper, we use the procedure SD4ft-Miner [6] The paper is organized as follows Applications of the SumatraTT to derive several data matrices suitable for further analysis are described in section 2 The procedure SD4ft-Miner mines for SD4ft-patterns that are introduced in section 3 together with results of several applications of the SD4ft-Miner Some concluding remarks are in section 5 2 Transforming Data by SumatraTT 21 Data Understanding Review of Important Properties The hepatitis source data is in a form of CSV files with a good documentation Data was primarily loaded from the text files (ilab e030704csv etc) into SQL database We prepared several steps of data preprocessing First, the data from both tables ilab and olab (internal and external exams) was merged and further considered together The merged table contains 1060 different types of exams (mainly due to disparate exam types in olab with 845 diff exams) To reduce this excessive number we have first decided to omit rare exam types and take into account only exam types with more than 14000 occurrences there are 41 exams of this sort Moreover, 81 important exam types were considered upon specialist s recommendation In this way, we have ended up with 105 exam types During the data preprocessing we have identified some important properties of the considered dataset which were not explicitly mentioned in the former articles dedicated to this dataset After reviewing several summary numbers on the next page, we decided to study the group with exactly one biopsy and without interferon therapy

Patients Description Patients Description 771 all 99 > 1 biopsy (all have type C) 503 with a biopsy 123 = 1 biopsy and interferon 74 > 1 biopsy and interferon 306 1 biopsy, no interferon 1 with interferon, no biopsy 460 1 exam before the first biopsy 281 = 1 biopsy, no interferon 459 1 exam in hemat table before 3 1 biopsy, no other exam the first biopsy (mid=808, 500, 202) 22 Temporal Characteristics of the Considered Attributes The patients are not examined regularly: a period between two examinations can range form one day to several months Some periods, when a patient is observed frequently, alternate with more restful periods The highest number of all exams for one patient is 12659 (patient # 321) This irregularity has to be taken into account when choosing the characteristics for description of the temporal properties of the measured values In order to standardize information provided about individual patients, we have decided to concentrate on data collected during a specific well defined time interval of a fixed length τ for each individual patient The considered interval does not start on a single date for all the patients On the contrary, it is tightly bound to the state of the individual patient: the considered interval ends (or begins) in a significant instant, which can be easily recognized in the available data of the patient, eg the time of his/her first biopsy, on the time when a specific treatment was introduced (eg interferon) The length of the time interval is set constant for all patients and it is understood to be a parameter τ of the considered project The number of measurements for one patient during one year ranges usually between 2-10 To make up for this non-uniformity we have decided to use the following trend characteristics of the considered sequence of time-stamped data: average, number of measurements, gradient (resulting from linear approximation), maximum, minimum, and variance For purposes of further data mining the results were saved as a matrix with patients in rows and trend characteristics in columns The rest of the paper tries to prove that this type of derived attributes can depict interesting dependencies in the considered data For that purpose we have to fix the significant instant and τ In the rest of the paper, the significant instant is set to the time of the first biopsy Moreover, we do not include in the studied dataset patients treated by interferon Only those patients with measurements during τ period (see further) before the first biopsy were selected In the next step, all the exams were filtered according to the following requirements: the exam must provide numeric value (omitting values like +, > 3 etc) and the type of exam must be measured at least 10 times for each considered patient The size of the resulting set is mentioned in section 23 Data selected in the previous steps was then analyzed as sequences and there were calculated upper mentioned trend characteristics Finally, data about patients was added (sex, age, type of hepatitis, maximum fibrosis and activity)

The data preprocessing resulted in a data matrix, rows of which correspond to particular patients identified by MID Columns of this data matrix contain various trend characteristics of considered examinations for the corresponding patients (eg ALB avg is an average of the values of ALB exam results) 23 Enhanced Datasets The following data matrices describing behavior of various characteristics before the first biopsy were prepared for data mining: TRENDS BIO 24 relates to patients who have history of exams at least τ = 24 months long, TRENDS BIO 12 to patients with exam history of τ = 12 months (investigated in detail further in the article), and TRENDS BIO 3 to patients with exam history of τ = 3 months The size of resulting datasets is increasing (53, 85, and 171) For pilot experiments the dataset corresponding to 12 months was chosen 3 SD4ft-patterns We use data matrix TRENDS BIO 12 shown in fig 1 to introduce the SD4ftpatterns Each row of TRENDS BIO 12 corresponds to one patient identified row Basic attributes In-hospital examinations number MID Sex Age Type Fibrosis Activity CL avg CL grad 1 1 M 29 B 2 2 X X 2 42 M 33 C 1 1 10312-55E-7 60 947 M 58 C 1 FALSE X X Fig 1 Data matrix TRENDS BIO 12 by the value of column MID Values of the column Sex come from the table pt e030704csv Column Age contains the age of the patient in the time of the first biopsy (bio e030704csv and pt e030704csv are used) Columns Type (ie hepatitis type), Fibrosis and Activity come from bio e030704csv and they indicate values at the time of the first biopsy The value X in the column CL avg in the row 1 means that the value of the of CL (ie chloride, see ilab e030704csv) for the patient with MID = 1 was not measured The value -55E-7 in the column CL grad in the row 2 is the value of the gradient of the linear approximation of the time series of the examinations of CL taken during the 12 months before the first biopsy for the patient with MID = 42 Analogously for further patients and columns The data matrix TRENDS BIO 12 has 224 columns with gradient, average, etc values of specific examinations The procedure SD4ft-Miner mines for SD4ft-patterns of the form α β : ϕ ψ / γ

M/(α γ) ψ ψ ϕ a α γ b α γ ϕ c α γ d α γ 4ft(ϕ, ψ, M/(α γ)) M/(β γ) ψ ψ ϕ a β γ b β γ ϕ c β γ d β γ 4ft(ϕ, ψ, M/(β γ)) Fig 2 4ft-tables 4ft(ϕ, ψ, M/(α γ)) and 4ft(ϕ, ψ, M/(β γ)) Here α, β, γ, ϕ, and ψ are Boolean attributes defined from the columns of analyzed data matrix M The SD4ft-pattern α β : ϕ ψ/γ means that the subsets of patients meeting the Boolean conditions α and β differ in what concerns the validity of association rule ϕ ψ when the condition given by Boolean attribute γ is satisfied A measure of difference is defined by the symbol that is called SD4ft-quantifier The association rule ϕ ψ means here a general relation of Boolean attributes ϕ and ψ in the sense of [5] An example of the SD4ft-pattern is the pattern Type(B) Type(C) : LDH grad( 0) D 04 GOT grad( 0) / Age(30 69) It means that the patients with hepatitis B differ from the patients with hepatitis C what concerns relation of Boolean attributes LDH grad( 0) (ie the value of LDH grad is 0) and GOT grad( 0) when we consider patients of the age 30 69 years The difference is given by the SD4ft-quantifier D 04 We introduce it using general notation α, β, γ, ϕ, and ψ The SD4ft-quantifier concerns two four-fold contingency tables (ie 4ft-tables) 4ft(ϕ, ψ, M/(α γ)) and 4ft(ϕ, ψ, M/(β γ)), see fig 2 The 4ft-table 4ft(ϕ, ψ, M/(α γ)) of ϕ and ψ on M/(α γ) is the contingency table of ϕ and ψ on M/(α γ) The data matrix M/(α γ) is a data submatrix of M that consists of exactly all rows of M satisfying α γ It means that M/(α γ) corresponds to all objects (ie rows) from the set defined by α that satisfy the condition γ It is 4ft(ϕ, ψ, M/(α γ)) = a α γ, b α γ, c α γ, d α γ where a α γ is the number of rows of data matrix M/(α γ) satisfying both ϕ and ψ, etc The 4ft-table 4ft(ϕ, ψ, M/(β γ)) of ϕ and ψ on M/(β γ) is defined analogously The SD4ft-quantifier D 04 is defined by the condition a α γ a β γ 04 a α γ + b α γ a β γ + b β γ This condition means that the difference between the confidence of the classical association rule ϕ ψ on data matrix M/(α γ)) and the confidence of this association rule on data matrix M/(β γ)) is at least 04 The SD4ft-pattern α β : ϕ D 04 ψ / γ is true on data matrix M if the condition a β γ a β γ +b β γ a α γ a α γ +b α γ 04 is satisfied The example SD-4ft pattern is verified using the 4ft-tables T B and T C see Fig 3 Let us note that the sum of all frequencies from 4ft-tables T B and T C is

TRENDS BIO 12 / (Type(B) Age(30-69)) GOT grad( 0) GOT grad( 0) LDH grad( 0) 11 0 LDH grad( 0) 6 5 T B = 4ft(LDH grad( 0),GOT grad( 0), TRENDS BIO 12/(Type(B) Age(30-69)) TRENDS BIO 12 / (Type(C) Age(30-69)) GOT grad( 0) GOT grad( 0) LDH grad( 0) 13 10 LDH grad( 0) 0 4 T C = 4ft(LDH grad( 0),GOT grad( 0), TRENDS BIO 12/(Type(C) Age(30-69)) Fig 3 4ft-tables T B and T C smaller than 60 because of omitting missing values X It is easy to verify that the the condition corresponding to the SD4ft quantifier D 04 is satisfied We can conclude that the SD4ft pattern Type(B) Type(C) : LDH grad( 0) D 04 GOT grad( 0) / Age(30 69) is true on the data matrix TRENDS BIO 12 Very informally speaking we can interpret this SD4ft pattern as The confidence of association rule (not negative gradient LDH) (not negative gradient GOT) is 04 greater for type B than for type C when we consider the patients 30-69 years old 4 SD4ft-Miner Application Results We solved three different tasks In the first task we searched for very simple SD4ft-patterns (without condition γ) Type(B) Type(C) : T RUE ψ where T RU E is a specially prepared basic Boolean attribute that is identically true and is a suitable SD4ft-quantifier (see below) Remark that the confidence of the association rule T RUE ψ is equal to the relative frequency of rows of analyzed data matrix satisfying ψ It means that we can use the SD4ft quantifier D 015 a α 10 a β 10 that says that the difference of relative frequencies is at least 015 and that there are at least 10 patients with type B hepatitis satisfying ψ and also at least 10 patients with type C hepatitis We use the set of relevant SD4ft-patterns Type(B) Type(C) : T RUE ψ such that the succedents are 903 intervals of averages of 22 in-hospital examinations, namely CL, D-BIL, F-CHO, FE, G-GL, G-GTP, GOT, GPT, HBE-AB,

HBE-AG, CHE, I-BIL, K, LDH, NA, Oudan, T-BIL, T-CHO, TG, TP, U-UBG, UN The amount of 903 intervals is defined by few parameters such that the resulting intervals are of reasonable size This amount was generated and verified in 1 sec (PC with 306 GHz, 512 MB DDR SDRAM) Due to various optimizations only 308 verifications was really done The result is 18 true SD4ft-patterns concerning 8 attributes One strongest pattern for each of these attributes is shown in table?? Remark that there are 27 patients with the hepatitis type B and 33 patients with the hepatitis type C frequency type B frequency type C literal relative R B absolute relative R C absolute R B R C TP avg( 7) 048 13 088 29-040 CHE avg(100; 400 067 18 091 30-024 F CHO avg(45; 65 048 13 030 10 018 CL avg(102; 105 041 11 058 19-017 I BIL avg(03; 06 078 21 061 20 017 UN avg(12; 16 059 16 042 14 017 T BIL avg(06; 09 055 15 039 13 016 G GTP avg(20; 50 052 14 036 12 016 Table 1 Differences of relative frequencies The difference of relative frequencies can be understood as a difference of confidences of association rules T RUE ψ for types B and C Thus it is reasonable to ask if there is a stronger difference than 04 for confidences of association rules ϕ ψ where both ϕ and ψ are similar literals as ψ in the previous section Thus we searched for SD4ft-patterns (without condition γ) of the form Type(B) Type(C) : ϕ ψ where is the SD4ft-quantifier defined as D 04 a α 10 a β 10 This quantifier says among other that the difference of confidences is at least 04 We defined the set of more than 815 000 of relevant SD4ft-patterns Due to various optimizations only 89 254 was generated and verified in about 2 seconds, see also [5] There are 27 SD4ft patterns satisfying given condition, all of them have the attribute TP avg in the succedent Thus we show only the three strongest ones and also further three ones not containing the attribute TP avg and found by an another run of the SD4ft-Miner procedure, see table?? We tried also to find some conditions under which is the difference of confidences even stronger We searched for SD4ft-patterns of the form more α β : ϕ ψ / γ where where the condition γ was created from Sex, Age, Fibrosis and Activity About 62 10 6 relevant patterns was verified The amount of 76 true

type B type C rule Conf B support Conf C support Conf B a B % a C % - Conf C CHE avg(100; 300 TP avg(5; 75 093 13 48 042 10 30 051 CL avg(102; 106 TP avg(65; 75 100 15 56 050 11 33 050 K avg 4; 44) TP avg(6; 75 100 17 63 050 10 30 050 Further rules with TP avg in succedent skipped LDH grad 0; 05) GPT grad 0; 05) 100 13 48 05 12 36 050 F-CHO grad 0; 05) GOT grad 0; 05) 094 16 59 044 12 36 050 CL avg(103; 107 I BIL avg 03; 06) 088 14 52 053 10 30 035 Table 2 Differences concerning pairs of examinations type B type C rule Conf B support Conf C support Conf B a B % a C % Conf C Condition: Age 40; Type B: 8 patients; Type C: 26 patients CHE avg(0; 300 TP avg(50; 75 100 7 88 038 8 31 062 Condition: Age 35; Type B: 15 patients; Type C: 28 patients K avg 4; 44) TP avg(60; 75 100 9 60 041 7 25 059 Oudan avg(4; 6 T BIL avg(05; 08 100 8 53 041 7 25 059 Oudan avg(4; 6 TP avg(05; 75 100 8 53 041 7 25 059 Condition: Fibrosis(1,2); Type B: 18 patients; Type C: 21 patients LDH grad 0; 05) GPT grad 0; 05) 100 11 61 05 8 38 050 F-CHO grad 0; 05) GPT grad 0; 05) 092 11 61 047 8 36 045 D BIL avg 01; 03) TP avg(50; 75 100 10 56 050 7 33 050 Table 3 Differences concerning pairs of examinations under conditions SD4ft-patterns with condition were found in 2 minutes and 23 seconds Some examples of strongest and interesting ones are in table?? 5 Conclusions and Further Work We have succeeded to find several patterns that indicate existence of differences in trend characteristics for hepatitis type B and type C The process is far from straightforward First, it was necessary to transform original data into suitable data matrix using SumatraTT and then the SD4ft-Miner procedure has been applied several times There seem to appear some strong rules but interpretation of the obtained results given in tables 1, 2, and 3 is impossible without relevant medical knowledge It will be very interesting to compare our results with those in [2] there are several attributes which have been identified as important by both approaches, namely T/BIL, CHE, GOT, GPT and TP Anyway, the considered set of 60 patients is too small the applied restrictions leading to

creation of the considered data matrix do not take optimal advantage of all the available data All the steps of our approach are easy to repeat and modify There are lot of possibilities how to do so Based on the experience with the present data and results of current data mining efforts, we are planning to modify selection criteria for the used preprocessing We believe that the suggested methodology based on selection of a time window related to some significant instant could prove useful when studying influence of the interferon therapy Further analysis should work with a new enhanced data set in which two significant instants are considered: one corresponds to the beginning of the interferon therapy, while the other is set several months after that This setting makes it possible to study changes in time patterns due to the therapy Moreover, measurements from the table hemat will be included Results from this new data are under investigation now The project showed, that a cooperation of the both tools, SumatraTT and LISp-Miner, is effective and allows fast data preprocessing and data mining cycle The whole process can be easily modified and reused for different data mining tasks (eg influence of interferon) and even to different datasets Acknowledgements The work described here has been supported by the grant 201/05/0325 of the Czech Science Foundation and the research program No MSM 6840770012 Transdisciplinary Research in Biomedical Engineering II of the CTU in Prague References 1 Hájek, P, Havránek, T: Mechanizing Hypothesis Formation (Mathematical Foundations for a General Theory), Springer Verlag 1978 2 Ho TB et al: Combining temporal abstraction and data mining to study hepatitis In Proceedings of the Discovery Chalenge 2004 A Collaborative Effort in Knowledge Discovery from Databases Prague: University of Economics, 2004 3 Kléma, J - Nováková, L - Karel, F - Štěpánková, O: Trend Analysis in Stulong Data In Proceedings of the Discovery Chalenge 2004 A Collaborative Effort in Knowledge Discovery from Databases Prague: University of Economics, 2004, pp 56 67 4 Rauch J, Šimůnek M (2000): Mining for 4ft Association Rules In: Arikawa S, Morishita (eds) Discovery Science, Springer Verlag, pp 268 272 5 Rauch J, Šimůnek M (2005) An Alternative Approach to Mining Association Rules In: Lin T Y, Ohsuga S, Liau C J, and Tsumoto S (eds) Data Mining: Foundations, Methods, and Applications, Springer-Verlag, 2005, pp 219 238 (to appear) 6 Rauch J, Šimůnek M (2005) GUHA Method and Granular Computing In: HU, Xiaohua, LIU, Qing, SKOWRON, Andrzej, LIN, Tsau Young, YAGER, Ronald R, ZANG, Bo (ed) Proceedings of IEEE International Conference on Granular Computing IEEE, 2005, pp 630 635 7 Šimůnek M (2003) Academic KDD Project LISp-Miner In Abraham A et al (eds) Advances in Soft Computing Intelligent Systems Design and Applications, Springer, Berlin Heidelberg New York

8 Štěpánková O, Aubrecht P, Kouba Z, Mikšovský P Preprocessing for Data Mining and Decision Support, pp 107 117 Kluwer Academic Publishers, Dordrecht, 2003