Mathematical comparison of Gene Set Z-score with similar analysis functions

Similar documents
y response variable x 1, x 2,, x k -- a set of explanatory variables

Supplementary materials Quantitative assessment of ribosome drop-off in E. coli

Z score indicates how far a raw score deviates from the sample mean in SD units. score Mean % Lower Bound

Passing-Bablok Regression for Method Comparison

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

Frequency Estimation of Rare Events by Adaptive Thresholding

Chapter 1 - Lecture 3 Measures of Location

Review of Statistics 101

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

2 Descriptive Statistics

FREQUENCY DISTRIBUTIONS AND PERCENTILES

Machine Learning Linear Classification. Prof. Matteo Matteucci

3rd Quartile. 1st Quartile) Minimum

Part 7: Glossary Overview

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Bivariate Paired Numerical Data

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

1 A Review of Correlation and Regression

Distribution Fitting (Censored Data)

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Validation Report. Calculator for Extrapolation of Net Weight in Conjunction with a Hypergeometric Sampling Plan

Data Mining Stat 588

Solutions exercises of Chapter 7

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Bio 183 Statistics in Research. B. Cleaning up your data: getting rid of problems

Active learning in sequence labeling

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

A Quadratic Curve Equating Method to Equate the First Three Moments in Equipercentile Equating

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Empirical Evaluation (Ch 5)

Lecture on Null Hypothesis Testing & Temporal Correlation

, 0 x < 2. a. Find the probability that the text is checked out for more than half an hour but less than an hour. = (1/2)2

Distribution-Free Procedures (Devore Chapter Fifteen)

MATH 1150 Chapter 2 Notation and Terminology

Stat 101 Exam 1 Important Formulas and Concepts 1

Temporal context calibrates interval timing

HOMEWORK #4: LOGISTIC REGRESSION

3.4 Linear Least-Squares Filter

Pointwise Exact Bootstrap Distributions of Cost Curves

CSE 312 Final Review: Section AA

Helsinki University of Technology Publications in Computer and Information Science December 2007 AN APPROXIMATION RATIO FOR BICLUSTERING

Student Activity: Finding Factors and Prime Factors

CHAPTER 1. Introduction

Lecture 06. DSUR CH 05 Exploring Assumptions of parametric statistics Hypothesis Testing Power

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Package IGG. R topics documented: April 9, 2018

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 78 le-tex

Stochastic Models of Manufacturing Systems

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Descriptive Data Summarization

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Supporting Information. Controlling the Airwaves: Incumbency Advantage and. Community Radio in Brazil

Descriptive Statistics-I. Dr Mahmoud Alhussami

Supporting Australian Mathematics Project. A guide for teachers Years 11 and 12. Probability and statistics: Module 25. Inference for means

Advanced Statistics II: Non Parametric Tests

Neural networks (not in book)

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Variance reduction. Timo Tiihonen

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

Exploratory Data Analysis. CERENA Instituto Superior Técnico

Statistical Reports in the Magruder Program

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 1: Descriptive Statistics

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Deviations from the Mean

Fundamentals of Similarity Search

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

MS&E 226: Small Data

Design and Implementation of CUSUM Exceedance Control Charts for Unknown Location

Chapter 3 Examining Data

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Hotelling s One- Sample T2

Boosting: Algorithms and Applications

HOMEWORK #4: LOGISTIC REGRESSION

The Binomial distribution. Probability theory 2. Example. The Binomial distribution

Probabilities and Statistics Probabilities and Statistics Probabilities and Statistics

Random variables (discrete)

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Describing Distributions

Confidence Intervals 1

Lectures on Elementary Probability. William G. Faris

Math 510 midterm 3 answers

Package aspi. R topics documented: September 20, 2016

Smoothed Prediction of the Onset of Tree Stem Radius Increase Based on Temperature Patterns

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

Algorithm Independent Topics Lecture 6

Time Series Classification

One-Sample Numerical Data

T-Test QUESTION T-TEST GROUPS = sex(1 2) /MISSING = ANALYSIS /VARIABLES = quiz1 quiz2 quiz3 quiz4 quiz5 final total /CRITERIA = CI(.95).

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

Transcription:

Mathematical comparison of Gene Set Z-score with similar analysis functions Petri Törönen 1, Pauli Ojala 2, Pekka Marttinen 3, Liisa Holm 14 1 The Holm Group, Biocenter II, Institute of Biotechnology, PO Box 56, 00014 University of Helsinki, Finland 2 Finnish Red Cross Blood Service Research and Development, Kivihaantie 7, 00310 Helsinki, Finland 3 Department of Mathematics and Statistics, P.O. Box 68, 00014 University of Helsinki, Finland 4 Department of Biological and Environmental Sciences, P.O. Box 56, 00014 University of Helsinki, Finland Email: Petri Toronen - firstname.lastname@helsinki.fi; Pauli Ojala - firstname.lastname@bts.redcross.fi; Pekka Marttinen - firstname.lastname@helsinki.fi; Liisa Holm - firstname.lastname@helsinki.fi; Corresponding author Short note This is the second supplementary text for the manuscript Novel Robust Universal Scoring Function for Threshold Free Gene Set Analysis. Text represents detailed comparison of the proposed method with other similar methods. This highlights the similarities with more standard methods and recent publications and shows that GSZ score represents a theoretical unification that encompasses these compared methods. Also a graphical representation of the similarities is represented. GSZ-score calculated using the whole list is similar to the correlation of GO class and gene expression data Our main text presents the Gene Set Z-score (GSZ-score) for analysis of subsets of genelists for the GO class related signals. It starts with a score: Diff M X i X i, (1) i S M i pos i S M i neg M X pos M X neg (2) a difference between the sums of scores for members (positive gene group pos) and non-members (neg) of a gene class, among the M genes with the highest differential expression test scores. For simplicity, we omit later the index M from Diff M. The equation 1 is more stable as a Z-score: 1

Figure 1: Visualization of the similarities between GSZ-score, Max-Mean and Random Sets. Figure represents the GSZ-score profile for ALL dataset (red line) for GO class immunological synapse. Profile is calculated over the ordered gene list. Figure also shows seven percentiles (blue lines), obtained from 500 column randomizations of the same class and dataset. Percentiles are: 0, 5, 25, 50, 75, 95, and 100. Notice the clear separation of the positive curve with the randomization results. In addition figure shows an approximate position that corresponds with Max-Mean scoring. This position is not necessarily exactly same across randomizations. Furthermore, Max-Mean lacks the stabilization performed with GSZ-score. Figure also shows the last point, where GSZ-score is identical with Random Sets scoring function and similar with correlation. This class shows simultaneous up and down-regulation resulting in a small average expression signal, seen at the last point. 2

where E(Diff) 2E(X)E(N) ME(X) and Diff E(Diff) Z D2 (Diff) + k, (3) D 2 (Diff) 4( D2 (X) M 1 (E(N)(M E(N)) D2 (N)) + E(X) 2 D 2 (N)) (4) where N is the number of positive genes in the subset, M is the size of the subset, E(N) is the mean and D 2 (N) is the variance of the hypergeometric distribution of the number of positive genes N for the analyzed subset. D 2 (X) and E(X) are the variance and the mean of the primary test scores for the subset. k represents the prior variance used to stabilize the function (see main text). The derivation of E(Diff) and D 2 (Diff) is shown in the main text. As a difference to earlier methods (Kolmogorov-Smirnov, modified Kolmogorov-Smirnov, iga), the signal can be calculated using the whole gene list (M L). In that case the variance and mean of the resulting hypergeometric distribution are D 2 (N) 0 and E(N) K, and the variance estimate of Diff is reduced to: D 2 (Diff) 4( D2 (X i ) M 1 (E(N)(M E(N)) D2 (N)) + E(X) 2 D 2 (N)) 4D2 (X i )K(L K) L 1 (5) The expectation of the Diff is similarly reduced to E(Diff) 2E(X)E(N) ME(X) (2K L)E(X) (6) Equation (5) is actually identical to the D 2 (S N) derived in the supplementary text 1 [see additional file 1] that is used in the derivation of the variance equation. A more detailed view on this equation shows that it can be considered as a function of two components: D 2 (X) and K(L K)/(L 1). The first one represents the variance of the test scores and the latter reminds variance of binary classification to class members (K) and non-members (L K). Similarly the equation (6) is reduced to the product of the expectation of a binary classification (2K L) and expectation of the differential expression values (E(X)). In overall, the whole equation becomes: 3

Z Xpos X neg E(Diff) D2 (Diff) Xpos X neg (2K L)E(X) 2 (D 2 (X i )K(L K))/(L 1) Xpos (L E(X) X pos ) (2K L)E(X) 2 (D 2 (X i )K(L K))/(L 1) 2 X pos 2KE(X) 2 (D 2 (X i )K(L K))/(L 1) (7) We used the rule L E(X) X pos + X neg, where X pos is a sum of expression values for positive genes (class members) taken from all data. This shows that Z-score is in this case very similar to the correlation between the class-member classification (binary variable Y ) and the given differential expression test scores (X). For comparison the correlation is: Corr E(XY ) E(X)E(Y ) D2 (X)D 2 (Y ) Xpos E(X)K D2 (X)L(K/L)(L K)/L Xpos E(X)K D2 (X)K(L K)/L (8) (9) (10) Here we have used assumptions E(Y ) K and D 2 (Y ) np(1 p) L(K/L)((L K)/L). Note that we omit from this comparison the contribution of the prior variance k. This is done to simplify the comparison of equations. Also, our detailed analysis shows that the prior variance is mostly clearly smaller than the estimated variance (data not shown), so it should only contribute a small stabilizing effect for the smallest variances. Notice that this case corresponds also to the scoring function represented by Newton et al. We discuss the differences between these methods later. Gene Set Z-score can be identical to Hypergeometric Z-score Another extreme case would occur, if we keep the order of the genes in the list but omit the test scores from our analysis and give to each gene a weight equal to unity. In this case D 2 (X) 0, E(X) 1, and the variance estimate reduces to D 2 (S) E(X) 2 D 2 (N) D 2 (N). So only the variance for the hypergeometric distribution remains. The expectation value for the sum changes similarly: 4

E(Diff) 2E(X)E(N) ME(X) 2E(N) M. The resulting Z-score is: Z raw Xpos X neg E(Diff) D2 (Diff) Xpos X neg (2E(N) M) 2 (D 2 (N) Xpos (M X pos ) 2E(N) + M 2N 2KE(N) 2 (D 2 (N) 2 (D 2 (N) (11) This is simply the hypergeometric Z-score for a variable N. We have again omitted the prior variance for simplicity. Similar phenomenon occurs (in theory) when data subset includes only a single data point (M 1, details not shown). Still a drawback in M 1 case is that the equation is not defined due the term M 1 in divider (generating 0/0). Due to this, we omit this case from analysis. Note that the hypergeometric distribution is currently probably the most popular method for analyzing gene lists using a threshold. The reasonable results obtained with iga (hypergeometric p-value based analysis) propose that a simple analysis of order would also be useful. This could be obtained by ordering the gene list with gene expression values, and then giving to each gene the expression value unity. Max-Mean statistic reminds Gene Set Z-score The score proposed by Efron & Tibshiriani was a max-mean statistic, where the analyzed gene set (positive genes) is divided into a half at X 0. This corresponds to our Gene Set Z-score analysis with the pre-positioning of the threshold at the exactly same location. Their score is based on sum of the expression values for both halves (half with positive regulation, half with negative regulation), next it divides each half with the whole size of the gene set. The bigger of the absolute values for two halves is selected as the score. MaxMean max(abs( X i ), abs( X i>0 X i<0 i pos i pos X i ))/N pos (12) This reminds our Z-score analysis as our analysis of differences can be considered as a function of sum of scores for positive genes. The differences are: (a) Max-mean scoring omits totally the class non-members 5

and potentially the large expression signals among them. This was corrected by running several row randomizations of expression data. We propose the usage of direct estimates of mean and STD for the selected subset. Our estimates take directly into account the deviation in the gene expression data, among both the members and non-members. Our estimates could be also directly used to stabilize the max-mean function. (b) With max-mean the subset with larger absolute sum is taken as the score. This does not take into the consideration the potentially larger overall deviation from the null in the other tail, or even the potential bias where the whole expression distribution would be shifted up or down. Thus, in such cases max-mean would be potentially unable to report the signals, occuring in the tail area with weaker overall signal. This could be again corrected by using our mean and STD estimates for the selected subsets. (c) Max-mean is used with a pre-fixed threshold position, whereas we propose the optimization of the threshold position so that the maximum absolute signal is obtained. While both of these methods have pros and cons, it might be still beneficial for the max-mean type scoring to allow the threshold to move at least within some frame. This way the predefined threshold would not affect the results so much. As an example, the analyzed datasets outside the actual differential gene expression analysis can include skewed distributions including only positive values. The used null threshold is clearly unusable with such datasets, and the selection of new useful threshold might be challenging in those cases. Note still that the work of Efron and Tibshiriani propose that one could use also GSZ-score with some pre-defined threshold positions. Score function by Newton et al. is GSZ-score calculated from the whole gene list While finishing this manuscript we were pointed to the publication by Newton et al. They represent similar method as ours. Their scoring starts with X N X pos/n, a mean value of differential expression for all the genes that belong to the analyzed class. Next, the mean and variance estimates for X are represented. These are practically the same equations as our equations for E(S N) and D 2 (S N), with only differences caused by division with N. So their scoring function corresponds GSZ score, when it is calculated using the whole gene list (equation 7). The major difference between these methods is that GSZ-score can monitor simultaneous up and down-regulation, as it goes through the gene list. 6