Marginal Balance of Spread Designs

Size: px

Start display at page:

Download "Marginal Balance of Spread Designs"

Liliana Edwards
5 years ago
Views:

1 Marginal Balance of Spread Designs For High Dimensional Binary Data Joe Verducci, Ohio State Mike Fligner, Ohio State Paul Blower, Leadscope

2 Motivation Database: M x N array of 0-1 bits M = number of compounds N = number of structural features Objective: Select a subset of m compounds for further testing, so that we can learn which combinations of structural features are associated with biological response.

3 Complication Testing for biological activity typically involves in vitro (or in silico) bioassays. Efficient designs utilize compounds containing about ½ of all features. For in vivo testing (e.g. gene expression in mice), compounds should contain close to a target proportion (p 0 < ½) of features because large compounds tend to be broken down.

4 Notation x = original binary string z = expanded binary string V = xx T v = V[lower.tri(V)] z T = [x T,v T ] = T m T x x X 1 = T m T z z Z 1

5 Modulated Response Model for In Vivo Activity Y = ) z α ( x β + ε α ( x) 1 exp ( j = 2 2 π h x 2h Np 0 ) 2 ε ~ N(0,1)

6 Information Matrix I ( β ) m = i= 1 α ( x i ) 2 z i z T i = Z T A 2 X Z

7 General Optimality Size Magnitude of eigenvalues maximize tr[ι(β)] Balance Equality of eigenvalues Both (for designs of full rank) Minimize tr[ι(β) -1 ] Maximize det[ι(β)]

8 Total Information tr [ I(β )] m i= 1 p exp i ( p + 1) ( p p ) i i 2( h / 0 N) is maximized for each p i = p 1 > p 0

9 Information from Each Compound Target Proportion p 0 =.12 information per compound h=n/10 h=n/ proportion of features

10 Marginal Balance Two Way Margins of Features p 1 = optimal proportion of compounds per feature M(1-p 1 ) 2 Mp 1 (1-p 1 ) 1 Mp 1 (1-p 1 ) Mp 1 2

11 Insufficiency of Marginal Balance Criterion Need additional criterion to spread 1s among different compounds Avoid near duplication of compounds

12 Example: 8 x 20 Design P 1 = Spread Design contains 76 different pairs of features Full 20 x 20 Cyclic Design contains 105 different pairs of features

13 Spread Design Select a subset S of fixed size m so as to maximize the minimum distance between points in S. Higgs Algorithm: -- Choose points sequentially: At each step, maximize minimum distance to already selected points. -- Leads to near optimal solution Choice of distance greatly effects resulting design.

14 XOR (Hamming Distance) XOR (Hamming): Only accounts for bits that don t match A: B: d 2 XOR = k A k ( XOR ) B k Larger structures have more bits that don t match each other Diversity Result: Tends to favor larger structures with a lot of features

15 Tanimoto Coefficient a = # bits on in A b = # bits on in B c = # bits on in both A and B d = # bits off in both A and B Tanimoto Coefficient T 1 = a + c b Measures similarity using on bits c Tanimoto Coefficient Complement T 0 = d a+ b 2c + d Measures similarity using off bits

16 MT = Modified Tanimoto Measure similarity based on the both the presence (on bits) and absence (off bits) of features α T + ( 1 α ) T 1 2 p 3 a + b where α =, and p =. 2 n When there are fewer on bits: T 1 is weighted more heavily. When there are fewer off bits: T 0 is weighted more heavily. As a variation, p may be fixed by external considerations. The result is called the P-Modified Tanimoto distance. 0

17 Implementing Spread Designs Maximin vs Average Distance Higgs Algorithm Stochastic Searches Near Optimal Solutions

18 Medicinal Drug Database 186 Leadscope Features Prevalence Range: Median: Mean: Drugs now in market Range: 5-70 distinct features per compound Median: 24 (12.8%) features per compound Mean: 26.4 (14.2%) features per compound

19 Procedure Use Higgs algorithm Apply with 4 different metrics Use each of 1089 compounds as initial seed Pick best (maximin distance) 150 designs for each metric Evaluate balance criterion for all designs Summarize

20 Average Number of Distinct Features of Sampled Compounds (Population Median 24 features/cmpd) Distance Hamming Tanimoto Mod.Tan. P-Mod.Tan Sample Size P =

21 Balances of Best Spread Design (of size 20) for Each Distance balance criterion tanimoto modified tanimoto p-modified tanimoto hamming P1

22 Balances with p 1 =.14 for Size 20 Uniform Balances for Best 150 Samples of Size ham tan mod tan random

23 Balances with p 1 =.20 for Size 20 Uniform Balances for Best 150 Samples of Size ham tan mod tan random

24 Balance Results for Medicinals Hamming distance gives worst balance whenever p 1 <.25 Random selection tends to be very erratic Both Tanimoto and Modified Tanimoto produce good balance for p 1 near the database median. Modified Tanimoto produces better balance than does Tanimoto for p 1 larger the database median.

25 Conclusion Using Modified Tanimoto in a spread design tends to produce the best balance over a wide operating range of compound sizes.

26 Recap Modulated Response Model Includes all two-way interaction terms Peak response for compounds of size p 0 Response falls off outside a window of width h Information Total Information maximized at p 1 >p 0 Balance defined in terms of 2-way margins Spread Designs Higgs Algorithm Metrics Hamming -- picks large compounds Tanimoto -- picks small compounds Modified Tanimoto picks medium compounds Medicinal Database Example

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver