Molecular Similarity Searching Using Inference Network

Molecular Similarity Searching Using Inference Network Ammar Abdo, Naomie Salim* Faculty of Computer Science & Information Systems Universiti Teknologi Malaysia

Molecular Similarity Searching Search for chemical compounds with similar structure or properties to a known compound A variety of methods used in these searches Graph theory 1 D, 2D and 3D shape similarity, docking similarity, electrostatic similarity and others. Machine learning methods e.g. BKD,SVM,NBC,NN Vector space model using 2D fingerprints and Tanimoto coefficients is one the most widely used molecular similarity measure

Rationale for Chemical Similarity Similar property principle structurally similar molecules are likely to have similar properties Given a known active molecule, a similarity search can identify further molecules in the database for testing

Probabilistic models (Alternative approach) Why probabilistic models Information Retrieval deals with Uncertain Information Query and compounds characterizations are incomplete Probability theory seems to be the most natural way to quantify uncertainty Applied in IR for text document

Why Bayesian Networks Bayesian Nets is the most popular way of doing probabilistic inference in AI Clear formalism to combine evidences Modularize the world (dependencies) Bayesian Network Models for IR Inference Network (Turtle & Croft, 1991) Belief Network (Ribeiro-Neto & Muntz, 1996) Simple

Bayesian inference Bayes Rule : the heart of Bayesian techniques P(H E) = P(E H)P(H) / P(E) where, H is a hypothesis and E is an evidence P(H) : prior probability P(H E) : posterior probability P(E H) : probability of E if H is true P(E) : a normalizing constant, then we write: P(H E) ~ P(E H)P(H)

Bayesian Networks What is a Bayesian networks? It is directed acyclic graphs (DAGs) in which nodes represent random variables, The parents of a node are those judged to be direct causes for it. The root of the network are the nodes without parents. The arcs represent casual relationships between these variables, and the strengths of these casual influences are expressed by conditional probabilities. x 1 x n : parent nodes, X the set of parents of y (in this case, root nodes) y : child node x i cause y The influence of X on y can be quantified by any function (conditional probabilities) F(y,X)=P(y X) x 1 x 2 y x n

Bayesian networks p(a) a c b p(c a,b) for all values for a,b,c p(b) Conditional dependence Running Bayesian Nets: Given probability distributions for roots and conditional probabilities of nodes, we can compute apriori probability of any instance Changes in parents (e.g., b was observed) will cause recomputation of probabilities

How to describe and compare molecules Network Model generation Description of the system in a suitable network form Representation of importance of descriptors (weighting schemes) Probability estimation for the network model Calculate the similarity scores Bayesian networks approach to molecular similarity searching

Bayesian inference network Nodes compounds (c j ) features (f i ) queries (q 1, q 2, and q r ) target (A) Edges from c j to its feature nodes f i indicate that the observation of c j increase the belief in the variables f i.

Definitions f 1, c j, and q 1 are random variables. F=(f 1, f 2,...,f n ) is an n-dimensional vector (equal to fingerprint length) f i, i {0, 1}, then f has 2x2 n possible states c j, j {0, 1}; q {0, 1} The rank of a compound c j is computed as P(q=true c j =true) (c j stands for a state where c j =true and i j c i =false, because we observe one compound at a time)

Direct Acyclic Graph (DAG) of compound nodes as roots, contain prior probability of observing compound feature nodes as leaves, contain probability associated with node given set of parent compounds Construct Compound Network (once)

Inverted DAG with single leaf for target molecule, multiple roots that correspond to the features that express query. A set of intermediate query nodes may also be used in case of a multiple query used to express the target. Attach it to compound Network Construct Query Network for each query

Find probability that target molecule (A) is satisfied given compound c j has been observed Instantiate each c j which corresponds to attaching evidence to network by stating that c j is true and rest of compounds as false Find subset of c j s which maximizes the probability value of node A (best subset). Retrieve these c j s as the answer to query. Similarity calculation

Bayesian inference network The retrieval of an active compound compared to a given target structure is obtained by means of an inference process through a network of dependences. To achieve the inference process We need to estimate the strength of the relationships represented by network This involves estimating & encoding a set of conditional probability distributions The inference network we have described comprise of four different layers of nodes (four different random variables), first layer comprise the compound nodes (roots) The probability associated with these nodes is define as: P(c j )=1/(collection size) prior probability

Bayesian inference network The second layer comprise of the feature nodes, so we need to compute P(f i ). P(f i c j ) will be computed as follows, since dependency is based on first layer (parent nodes). Weighting function is used to estimate the probability in p(f i /c j ) where α is a constant and experiments using the inference network (Turtle, 1991) show that the best value for α is 0.4, ff ij is the frequency of the i th feature within j th compound, icf i is the inverse compound frequency of the i th feature in the collection, cl j is the size of j th compound, total_cl is the total length of compounds in the collection, and m is total number of compounds in the collection (this Eq. has been adapted from Okapi retrieval system (Robertson et al., 1995))

Bayesian inference network The third layer comprises only the query nodes p(q k ) where c jk the set of features in common between j th compound and k th query, cl j is the size of j th compound, nff ik is the normalized frequency of the i th feature within k th query, nicf i is the normalized inverse compound frequency of the i th feature in the collection and p i is the estimated probability at the i th feature node. where ff ij is the frequency of the i th feature within j th compound,

Bayesian inference network The last layer comprises only the activity-need node (target) or bel(a) in the case of where more than one query is used. Weighted MAX Weighted SUM where c jk is the set of feature in common between j th compound and k th query, ql k is the size of the k th query, p jk is the estimated probability that the k th query is met by the j th compound, and r is the number of queries.

Experimental details Subset of MDDR with 40751 molecules 12 activity classes In all, 6804 actives in the 12 classes 10 set of 10 randomly chosen compounds from each activity class (to form a set of queries). For comparison purpose, similarity calculation is also done using non-binary Tanimoto coefficient Six different type of weighted fingerprints from Scitegic atom type extended-connectivity counts (ECFC), functional class extended-connectivity counts (FCFC), atom type atom environment counts (EEFC), functional class atom environment counts (FEFC), atom type hashed atom environment counts (EHFC), and functional class hashed atom environment counts (FHFC)

MDDR Data no. of unique av. no. mols. diversity Code Activity class Actives AF a MF b AF MF mean SD 5H3 5HT3 antagonists 213 133 87 1.60 2.45 0.8537 0.008 5HA 5HT1A agonists 116 67 54 1.73 2.15 0.8496 0.007 D2A D2 antagonists 143 109 75 1.31 1.91 0.8526 0.005 Ren Renin inhibitors 993 542 328 1.83 3.03 0.7188 0.002 Ang Angiontensin II AT1 antagonists 1367 698 396 1.96 3.45 0.7762 0.002 Thr Thrombin inhibitors 885 528 335 1.68 2.64 0.8283 0.002 SPA Substance P antagonists 264 119 78 2.22 3.38 0.8284 0.006 HIV HIV-1 protease inhibitors 715 455 330 1.57 2.17 0.8048 0.004 Cyc Cyclooxygenase inhibitors 162 83 44 1.95 3.68 0.8717 0.006 Kin Tyrosin protein kinase inhibitors 453 247 162 1.83 2.80 0.8699 0.006 PAF PAF antagonists 716 381 252 1.88 2.84 0.8669 0.004 HMG HMG-CoA reductase inhibitors 777 337 168 2.31 4.63 0.8230 0.002 a Unique AF is the number of unique atomic frameworks present in the class. b Unique MF is the number of unique molecular frameworks present in the class.

Use of a single reference structure Highest diverse class

Use of a single reference structure Comparison of the average percentage of unique atomic frameworks obtained in the top 5% of the ranked test set using BIN & Tan with EHFC_4 Highest diverse class

Use of multiple reference structures Comparison between BIN & Tan using MAX rule and ECFC_4

Use of multiple reference structures Comparison of the average percentage of atomic frameworks retrieved obtained in the top 5% of the ranked test set using BIN-MAX & Tan-MAX with ECFC_4

BIN with multiple molecular descriptors So far have considered using just a single molecular descriptor and multiple reference structures as the basis for a search Further work to search with multiple molecular descriptors (ECFC4, EHFC4, FHFC4, FPFC4,PHPFC3) with single and multiple reference structures

Use of a single molecular descriptors and a single reference structure Compound nodes c 1 c 2 c m Feature nodes D 1 f 1 f n f 1 f n f 1 f n D 2 D s Query nodes D 1 D 2 D s q 1 q r q 1 q r q 1 q r Weighted-max link matrices wmax 1 wmax 2 wmax s wsum Target node A

Use of multiple molecular descriptors and a single reference structure Comparison between multiple descriptors and single descriptor with single reference structure using BIN

Use of multiple molecular descriptors and a multiple reference structures Comparison between multiple descriptors and single descriptor with multiple reference structures using BIN

Summary I BIN method with a single active reference structure outperforms the Tanimoto similarity method in 11 classes (between 6% to 71%) 19% overall improvement only in one activity (Cyclooxygenase inhibitors) BIN is slightly inferior to Tan (-5%) BIN with multiple reference structures superior to Tan in all activity classes (between 5% to 118%) significantly outperform Tan with overall improvements 35% performance improvement in the overall average recall rate

Summary II BIN with multiple descriptors and single reference structure slightly outperform the BIN with single descriptor and single reference BIN with multiple descriptors and multiple reference structures slightly outperform the BIN with multiple descriptors and multiple references BIN with multiple descriptors will enhance performance (with a high percentage) when the sought actives are structurally heterogeneous. But it will slightly enhance performance when the sought actives are structurally homogeneous.

Summary III Some evidence to suggest that the BIN is more effective at scaffold hopping for the more diverse data sets. The networks do not impose additional costs because the networks do not include cycles. The major strength is net combination of distinct evidential sources to support the rank of a given compound. BIN provide the ability to integrate into a single framework, several descriptors and several references

Use of a single reference structure Thank you