Additional File 1 - Detailed explanation of the expression level CPD

Similar documents
Chapter 6 The Effect of the GPS Systematic Errors on Deformation Parameters

Specification -- Assumptions of the Simple Classical Linear Regression Model (CLRM) 1. Introduction

Team. Outline. Statistics and Art: Sampling, Response Error, Mixed Models, Missing Data, and Inference

The multivariate Gaussian probability density function for random vector X (X 1,,X ) T. diagonal term of, denoted

AP Statistics Ch 3 Examining Relationships

Pythagorean triples. Leen Noordzij.

Lecture 10 Support Vector Machines II

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Optimal inference of sameness Supporting information

Root Locus Techniques

Harmonic oscillator approximation

Estimation of Finite Population Total under PPS Sampling in Presence of Extra Auxiliary Information

The Geometry of Logit and Probit

Two Approaches to Proving. Goldbach s Conjecture

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Information Acquisition in Global Games of Regime Change (Online Appendix)

Small signal analysis

MULTIPLE REGRESSION ANALYSIS For the Case of Two Regressors

Generalized Linear Methods

Predictors Using Partially Conditional 2 Stage Response Error Ed Stanek

Scattering of two identical particles in the center-of. of-mass frame. (b)

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Linear Approximation with Regularization and Moving Least Squares

Introduction to Interfacial Segregation. Xiaozhe Zhang 10/02/2015

Chapter 13: Multiple Regression

Discrete Simultaneous Perturbation Stochastic Approximation on Loss Function with Noisy Measurements

A Weighted UTASTAR Method for the Multiple Criteria Decision Making with Interval Numbers

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Confidence intervals for the difference and the ratio of Lognormal means with bounded parameters

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

DUE: WEDS FEB 21ST 2018

Chapter 11. Supplemental Text Material. The method of steepest ascent can be derived as follows. Suppose that we have fit a firstorder

STAT 511 FINAL EXAM NAME Spring 2001

APPENDIX A Some Linear Algebra

Improvements on Waring s Problem

Linear Regression Analysis: Terminology and Notation

Joint Source Coding and Higher-Dimension Modulation

Lecture Notes on Linear Regression

Problem Set 9 Solutions

The Second Anti-Mathima on Game Theory

A new construction of 3-separable matrices via an improved decoding of Macula s construction

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

The Study of Teaching-learning-based Optimization Algorithm

Clustering gene expression data & the EM algorithm

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Negative Binomial Regression

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Limited Dependent Variables

Markov Chain Monte Carlo Lecture 6

1 Binary Response Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

[ ] λ λ λ. Multicollinearity. multicollinearity Ragnar Frisch (1934) perfect exact. collinearity. multicollinearity. exact

Start Point and Trajectory Analysis for the Minimal Time System Design Algorithm

x = , so that calculated

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

Calculation of time complexity (3%)

Notes on Frequency Estimation in Data Streams

Comparison of Regression Lines

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Solution Methods for Time-indexed MIP Models for Chemical Production Scheduling

a. (All your answers should be in the letter!

Structure and Drive Paul A. Jensen Copyright July 20, 2003

6. Stochastic processes (2)

Week 5: Neural Networks

6. Stochastic processes (2)

Report on Image warping

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Statistical Properties of the OLS Coefficient Estimators. 1. Introduction

A LINEAR PROGRAM TO COMPARE MULTIPLE GROSS CREDIT LOSS FORECASTS. Dr. Derald E. Wentzien, Wesley College, (302) ,

Kernel Methods and SVMs Extension

Difference Equations

Boostrapaggregating (Bagging)

Lecture 2: Prelude to the big shrink

4DVAR, according to the name, is a four-dimensional variational method.

Learning from Data 1 Naive Bayes

Chapter 4: Regression With One Regressor

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

CHAPTER 17 Amortized Analysis

Introduction. Modeling Data. Approach. Quality of Fit. Likelihood. Probabilistic Approach

Complete subgraphs in multipartite graphs

Maximum likelihood. Fredrik Ronquist. September 28, 2005

Hidden Markov Models

Iterative Methods for Searching Optimal Classifier Combination Function

Foundations of Arithmetic

VQ widely used in coding speech, image, and video

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Classification as a Regression Problem

Random Walks on Digraphs

Transcription:

Addtonal Fle - Detaled explanaton of the expreon level CPD A mentoned n the man text, the man CPD for the uterng model cont of two ndvdual factor: P( level gen P( level gen P ( level gen 2 (.).. CPD factor : P (level gen Th factor decrbe the man condtonal probablty that an expreon level belong to a dtrbuton, determned by the gene to uter agnment gen the array to uter agnment B and a pecfc array ID ID. Below we frt defne the CPD P ( ) for three pecfc tuaton: one n whch the expreon value wa agned to the background, one n whch the expreon value wa agned to a ngle uter and one n whch the expreon value agned to dfferent overlappng uter. Baed on thee pecfc defnton (ndcated by a ) we ntroduce the generalzed defnton of P ( ) that cover all three tuaton. Stuaton : background dtrbuton If the expreon level not part of any uter (genb B = ø), t agned to a vrtual uter wth ndex - that decrbe the background. Th uter decrbed wth eparate Normal dtrbuton (µ bgr a,σ bgr a ), one for each array a. The parameter of thee dtrbuton are fxed and derved a pror from the dataet ung a robut etmaton.

Stuaton 2: uter wthout overlap If no overlap occur between dfferent uter, each expreon level can only be agned to exactly one uter, each of whch modeled wth Normal dtrbuton wth parameter (µ,σ). The value of thee parameter depend on the gene to uter and array to uter agnment (g. a.b) and on the unque array dentfer a.id. The probablty P ( ) to oberve an expreon level that belong to a ngle uter only, defned a: P ( level gen P P ( level array uter ( level 2 { b}) b, b) (.2) ( level b) exp 2, 2 2 a b b We ntroduced the probablty P (level array = uter = {b}) a the probablty that an expreon level belong to a ngle uter. The attrbute uter doe not formally ext n the model, but t mplctly defned a the et of uter ndce to whch the expreon level belong, namely the nterecton B genb. Stuaton 3: overlappng uter When dfferent uter overlap, an expreon level can belong to multple uter. To avod overfttng t eem approprate to model the overlap regon ung the parameter et that were already defned for the ndvdual uter (tuaton 2,., one parameter et per arrayuter combnaton). For example, by relyng on a defnton of the overlap, P ( ) would be agned a hgh probablty f the expreon level ether approxmate the um, average,

weghted um, mnmum, or the maxmum, etc. of the probablty dtrbuton n the contrbutng uter. In our model we chooe for an overlap model where the probablty of an expreon level n the overlap regon defned a the geometrc mean of the probablte agned to the expreon level baed on the dtrbuton of the ndvdual uter. For computatonal reaon, we aumed that the tandard devaton of the dtrbuton of the overlappng uter are almot dentcal and that an expreon level can maxmally belong to two uter and. Formally, P ( ) can then be defned a: P ( level gen bet ( B ) b{ et( gen B) et( B)} P ( level, e P ( level b) b b ) /#{ et( gen B) et( B)} e /# et( B ) (.3) where the followng notaton ued: et(x), denotng the et of ndce for whch the vector element X of bnary vector X are. B e defned a the dot product of genb and B. Therefore, et(b e) the et of uter-ndce n the nterecton of genb and or formally: et(b e) = et(genb) et(b). Fnally, #et(b e) the number of element n th et. Generalzed formula The followng notaton cover all tuaton mentoned above: P ( level gen bet( B ) P ( level e /# et( Be ) b, b) (.4)

Stuaton 2 mplctly covered n the notaton of tuaton 3 a t can be formulated a a pecal cae of overlap wth only one uter. Stuaton covered by the ue of the vrtual uter wth ndex -. Th background uter can by defnton not overlap wth any other uter. The defnton of the et et(b e) alo lghtly dfferent from how t wa defned n tuaton 2 a B e b et( ) now cover: B e empty: background dtrbuton, the product over the et b [-] o et(b e) = [- ]. B e not empty: uter dtrbuton, the product over the et of uter n the nterecton and never nclude b = - by defnton..2. CPD factor 2: P 2 (level gen Wthout penalzng for model complexty, the MAP oluton would nclude a very large number of uter nce each addtonal uter ntroduce addtonal degree of freedom to model the expreon value. Model wth many uter can better explan the data and thu reult n hgher MAP oluton. Reducng model complexty n a tradtonal way by ncludng addtonal term n the log-lkelhood or log-poteror dtrbuton (uch a the Bayean nformaton crteron (BIC) [] or the Akake nformaton crteron (AIC) [2]) would lead to computatonal ntractablty f an Expectaton-Maxmzaton algorthm ued to fnd the MAP oluton. The optmzaton algorthm aume ndependent optmzaton per gene or per array n the ubtep of the EM procedur Th ndependency doe no longer ext f one of the crtera mentoned above ncluded n the model. Therefore, an alternatve trategy ued to reduce model complexty by ntroducng a penalty factor P 2 ( ). The addtonal penalty factor P 2 ( ) defned uch that t only allow a et of

expreon level to be ncluded n a uter f they are on average N tme more lkely to be n ther repectve uter dtrbuton than n ther background dtrbuton. The factor P 2 ( ) decompoe mlarly to P ( ), leadng to the followng expreon: # etb ( e) 2 ( level gen P2 ( level b) bet ( e) P (.5) where P 2 (level b) = π bgr decrbe that probablty that the expreon level belong to the background uter (b = -) and P 2 (level b) = π decrbe the probablty that the expreon level belong to a uter other than the background (b -). Th mple that a ubet of expreon level E for a partcular gene or array wll be agned to a uter f Equaton (.6) hold: B ee S P ( e, ) ee ee bgr ee ee P ( e, bgr) P ( e, bgr) P ( e, ) ee bgr (.6) The uer-defned rato bgr ndcate how many tme more lkely t mut be on average that an expreon value part of the uter dtrbuton compared to beng part of the background dtrbuton before uch a et of expreon value E actually added to the uter. To gude the uer n determnng th rato, we aume there ext one or more et of gene n the dataet that are known to be coexpreed. In mot practcal bologcal tuaton, uch known et of gene ext (g., a et of operon gene). If uch a et would not be avalable, tandard cluterng technque can alo be ued to dentfy one or more thee cluter. Fgure. llutrate how to chooe the rato bgr gven that a et of gene known to be coexpreed. We calculate for

every array the probablty that generated by a uter dtrbuton to whch t agned veru t core of beng generated by the background dtrbuton. The dfference between thee two probablty core, defned a δ. If the condton under whch thee gene are coexpreed alo known n advance (ee Fgure. (top panel)), we ue the known label that ndcate whether or not the array belong to the uter to tran a clafer. Th mple determnng the optmal threhold of δ o that the global error rate of mclafyng an array wth known label mnmzed (= the product of the fale potve rate and the fale negatve rate). If the condton are unknown n advance, a plot of orted δ mad The uggeted δ the one that make the bet dtncton between array wth a low δ and array wth a hgh δ value (cut-off pont) a hown n Fgure. (bottom panel).

Fgure.. Determnng the rato log bgr. (top) Reult of a mulated 500x200 dataet wth three 50x50 uter (noe level 0.2). The plot how the δ over all array (multpled wth the number of uter) for a et of gene that are known to be coexpreed n a number of array. Large δ ndcate that the expreon level of the gene are more lkely to be part of the uter dtrbuton for that array than to be part of the background dtrbuton. The δ threhold that bet clafe thee two et of array accordng to the rato 0.5, leadng to an optmal

rato of log 0. 5. (bottom) In the E. col compendum for the et of gene that are known to be bgr regulated by FNR, the plot how orted δ over all array. Baed on th plot, well choen value for bgr log range between -0.5 and -.0. Reference. Schwarz G: Etmatng the Dmenon of a Model. Annal of Stattc 978, 6:46-464. 2. Akake H: A new look at the tattcal model ndentfcaton. IEEE Tranacton on Automatc Control 974, 9:76-722.