Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning

Similar documents
Probability and Stochastic Processes

Measuring community health outcomes: New approaches for public health services research

Linear Regression (1/1/17)

Appendix B. Additional Results for. Social Class and Workers= Rent,

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Multivariate analysis of genetic data: an introduction

p(d g A,g B )p(g B ), g B

Prince County Hospital. Summerside, Prince Edward Island

Towards Detecting Protein Complexes from Protein Interaction Data

Estimation and Prediction Scenarios

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Sample Size Estimation for Studies of High-Dimensional Data

Supporting Information

20 Unsupervised Learning and Principal Components Analysis (PCA)

1 Using standard errors when comparing estimated values

Least Squares Classification

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Urban characteristics attributable to density-driven tie formation

Course on Inverse Problems

Lecture 7: Hypothesis Testing and ANOVA

1/sqrt(B) convergence 1/B convergence B

STAC51: Categorical data Analysis

This midterm covers Chapters 6 and 7 in WMS (and the notes). The following problems are stratified by chapter.

Association studies and regression

Induction of Decision Trees

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Stochastic Subgradient Methods

Dimensionality Reduction for Exponential Family Data

Improving confidence while predicting trends in temporal disease networks

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Semi-Nonparametric Inferences for Massive Data

Research Statement on Statistics Jun Zhang

02 Background Minimum background on probability. Random process

STAT 302 Introduction to Probability Learning Outcomes. Textbook: A First Course in Probability by Sheldon Ross, 8 th ed.

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

A general mixed model approach for spatio-temporal regression data

odeling atient ortality from linical ote

Agreement Coefficients and Statistical Inference

Variance Component Models for Quantitative Traits. Biostatistics 666

NETWORK BIOLOGY AND COMPLEX DISEASES. Ahto Salumets

Lecture 01: Introduction

STA 584 Supplementary Examples (not to be graded) Fall, 2003

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Spatio-temporal modeling of weekly malaria incidence in children under 5 for early epidemic detection in Mozambique

STAT 414: Introduction to Probability Theory

Table 1: Cold Weather Injuries by Installation and Region - December 2018

Probabilistic Models for Learning Data Representations. Andreas Damianou

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

The Multivariate Gaussian Distribution

b. ( ) ( ) ( ) ( ) ( ) 5. Independence: Two events (A & B) are independent if one of the conditions listed below is satisfied; ( ) ( ) ( )

10/8/2014. The Multivariate Gaussian Distribution. Time-Series Plot of Tetrode Recordings. Recordings. Tetrode

Understanding the incremental value of novel diagnostic tests for tuberculosis

STAT 418: Probability and Stochastic Processes

CHAPTER 3 FEATURE EXTRACTION USING GENETIC ALGORITHM BASED PRINCIPAL COMPONENT ANALYSIS

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Latent Variable models for GWAs

Statistical Inferences for Isoform Expression in RNA-Seq

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The Multivariate Gaussian Distribution [DRAFT]

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

1. Let A and B be two events such that P(A)=0.6 and P(B)=0.6. Which of the following MUST be true?

MEDLINE Clinical Laboratory Sciences Journals

CSC2515 Assignment #2

Fast Likelihood-Free Inference via Bayesian Optimization

Semester , Example Exam 1

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM

INTRODUCTION TO PATTERN RECOGNITION

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

The Power Law: Hallmark Of A Complex System

x log x, which is strictly convex, and use Jensen s Inequality:

What is the expectation maximization algorithm?

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Modular Program Report

Introduction to Probability and Stocastic Processes - Part I

MATH/STAT 3360, Probability

Network by Weighted Graph Mining

Non-specific filtering and control of false positives

EXAM # 3 PLEASE SHOW ALL WORK!

Unsupervised machine learning

Practice problems from chapters 2 and 3

Knockoffs as Post-Selection Inference

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

1.5 Sequence alignment

arxiv: v1 [cs.cv] 10 Feb 2016

Template-Based Representations. Sargur Srihari

SUPPLEMENTARY INFORMATION

Genotype Imputation. Biostatistics 666

Approximate Bayesian Computation

Introduction to Gaussian Processes

Asymptotic distribution of the largest eigenvalue with application to genetic data

Sorting algorithms. Sorting algorithms

Lecture 21: October 19

Statistics for Economists Lectures 6 & 7. Asrat Temesgen Stockholm University

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Learning Objectives for Stat 225

Transcription:

Supplementary Information Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning Junhee Seok 1, Yeong Seon Kang 2* 1 School of Electrical Engineering, Korea University, Seoul, South Korea. 2 Department of Business Administration, University of Seoul, Seoul, South Korea. * Corresponding authors

Supplementary Methods Simulation Settings The proposed calculation of mutual information was evaluated through simulation studies. In each simulation, we considered a certain relation between two discrete variables where the true mutual information can be calculated. For two categorical variables X and Y whose relation was predefined, the mutual information was estimated from random samples. Six different relations were used for the simulation studies with various sample sizes (50, 100, 200, 500, 1,000, and 2,000), and numbers of categories per variable (2, 5, 10, 20, 50, and 100). Each simulation was repeated 100 times. The proposed methods with different p-value thresholds were compared with the true mutual information as well as the results of conventional calculation. All simulations are for the variables of which categorical values cannot be ordered. Six simulation settings are listed here. (1) Step structure with low mutual information: Considering X and Y with two super categories for each, i.e. X {w 1,w 2 } and Y {v 1,v 2 }, their relation is given by 2 x 2 joint probabilities. We consider that p(w 1,v 1 )=0.4, p(w 1,v 2 )=0.1, p(w 2,v 1 )=0.2, and p(w 2,v 2 )=0.3. Here, we assume that n/2 fine categories per each super category are actually observed instead of the super categories. Let x 1,, x n/2 be observed for w 1, and x n/2+1,, x n be for w 2. Similarly, y 1,, y n are assumed to be observed for the super categories of Y. The combinations of the fine categories are assumed to be uniformly distributed within the corresponding combination of the super categories. For example, (n/2) 2 combinations of {x 1, x 2,, x n/2 } {y 1, y 2,, y n/2 }are uniformly distributed in w 1 v 1 of which probability is 0.4. For the simulation, the mutual information of X and Y is estimated from randomly generated data with n fine categories per variable, and compared with the true mutual information, 0.09 bits. (2) Step structure with high mutual information: This setting is similar with (1), but the joint probability of the super categories are given as p(w 1,v 1 )=0.7, p(w 1,v 2 )=0, p(w 2,v 1 )=0, and

p(w 2,v 2 )=0.3. Here, the true mutual information is 0.61 bits. (3) Gaussian structure with low mutual information: The joint population of X and Y is defined by a bivariate joint Gaussian distribution, of which marginal distributions are standard and the covariance (σ) is 0.49. First, continuous random samples are generated from the joint distribution. The observed range of the continuous samples is uniformly discretized as n categories for each variable. Each sample falls into one of n 2 combinations of discretized X and Y, and has corresponding categorical values for X and Y. From the data with n categories per variable, the mutual information is estimated. Different from continuous variables, here the Gaussian structure is hardly observed because the discretized categories are observed without orders. When the marginal variance is 1, the theoretical mutual information of a joint Gaussian distribution is given as log(1/(1 σ 2 )), which is 0.14 bits in this case. (4) Gaussian structure with high mutual information: This setting is similar with (3), but the covariance of X and Y is given as 0.81. The true mutual information is 0.53 bits. (5) Random structure with low mutual information: A random relation between X and Y can be constructed by randomly generated joint probability masses. The probabilities of n categories of X are generated from an exponential distribution with λ=1. Let p X denote the vector of these n probability masses. Similarly, the marginal probability mass vector of Y, p Y, is randomly generated from the same exponential distribution. We obtain an n x n matrix of the joint probability distribution, P 1 = p X p T Y. P 1 is the joint probability masses of a randomly structured but independent relation. Independently, we obtain P 2, another n x n probability matrix, by randomly generating n 2 joint probability masses from an exponential distribution (λ=1). P 2 represents a randomly structured and dependent relation. To ensure random structure and dependency, (P 1 +P 2 )/2 is used for the final joint probability distribution. Samples are randomly generated by the final joint probabilities, from which the mutual information is estimated. Although the theoretical mutual information is hard to be obtained in this case, it can be empirically estimated with many samples (one million in this work). The true mutual information is expected to be different as the number of categories.

(6) Random structure with high mutual information: This setting is similar with (5), but only P 2 is used for the final probabilities. In this case, X and Y are more dependent to each other than (5). Consequently, this setting simulates a random structure with higher mutual information than (5). Time Complexity The number of samples in the input data (n) is often used as the input data size for the calculation of time complexity. In the proposed algorithm, the number of categories of input discrete variables (d) can be considered as another input data size of interest. Here, we calculate the time complexity for each input data size, n and d. Consider the worst case that the whole sample space is partitioned into the finest combinations of categories of two variables. At every partitioning the marginal population should be calculated, which is simply counting samples. Since the number of partitions only depends on the number of categories in the worst case, the overall complexity as the number of samples n is O(n). To derive the time complexity for the number of variable categories d, assume that two variables can have d categorical values. At every partitioning, categorical values should be sorted, of which complexity is klogk when there are k categories in a subregion. In the worst case, k categorical values are partitioned into two subregions with 1 and k-1 categories, respectively. Therefore, the overall complexity for the d k = 2 1 number of categories d is O( logk) k.

Supplementary Table S1. Available data sets of discrete variables with many categories. Data set Description Variable MIMIC2 1 Medicare 2 STRIDE 3 T-cell repertoire 4 PheWAS 5 Clinical records in intensive care Medicare records in the US (1990-1993) Electronic health records of Stanford hospitals (2005-2010) Short sequencing reads of T-cell receptors mapped to specific V, D and J segments Phenome-wide association study Diagnosis of admission Number of Categories Sample Size ~300 ~5,000 Inpatient claim ~1,000 ~32M Inpatient claim ~1,000 ~2.7M Medical operation ~1,000 ~3.1M V segment ~80 D segment ~30 ~1M J segment ~6 Disease group ~1,500 Genotype of n SNPs 3 n ~13,000 1 Scott, D. J. et al. Accessing the public MIMIC-II intensive care relational database for clinical research. BMC Med Inform Decis Mak 13, 9 (2013). 2 Hidalgo C.A. et al. A Dynamic Network Approach for the Study of Human Phenotypes. PLoS Computational Biology, 5(4):e1000353 (2009). 3 https://med.stanford.edu/clinicalinformatics.html 4 Georgiou, G. et al. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotechnol 32, 158-168 (2014). 5 Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26, 1205-1210 (2010).

Supplementary Figure S1. Noise effects on mutual information. Uniformly distributed noise was added to samples generated by the simulation setting (4) in the main text. The x-axis is the ratio of noise samples among the whole samples. Shown are the expected mutual information (black dashed) as well as estimations by the proposed method (red solid) and the conventional method (black solid).

Supplementary Figure S2. Mutual information of the ICU data with various p-value thresholds. Shown are the mutual information values calculated from 296 x 252 3-digit ICD9 codes (level-3 categories) as well as 17x16 level-1 and 89 x 76 level-2 categories, by the proposed method with p-value threshold 0.5 (red) and the conventional method (black). The results of the proposed method with various p-value thresholds are also shown (red dashed).