Kristin P. Bennett. Rensselaer Polytechnic Institute

Size: px

Start display at page:

Download "Kristin P. Bennett. Rensselaer Polytechnic Institute"

Agnes Sophie Lawrence
5 years ago
Views:

1 Application in Cheminformatics Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute

2 Regression Case Study Given for each Molecule i Descriptor vector x i Bioresponse Construct a function to predict bioresponse Bioresponse is a real valued measurement Use SVM Regression y i f( x ) i y i

3 Kernel Regression Assume function is linear f( x) = x w+ b Pick loss e.g. loss( f ( x), y) = ( y f ( x)) 2 Least Squares LAD E-insensitive -E +E

4 Support Vector Regression (SVR) Points in ε-tube are treated as having no error. Robust least absolute deviation used outside of tube. ε-insensitive loss function: L ( y f ( x)) : = max(0, y f ( x) ε ) ε ξ * L ε -ε ε y-f(x)

5 Primal Problem with Regularization i = 1 min max(0, y ( x w b) ε ) + w wbz,, Convert to Quadratic Program 1 2 ( ξ ξ * ) i i min C + + w st. i= 1 ( ) i ( ) i y x w+ b ξ ε y x w+ b + ξ ε ξ * i *, ξ i i 0 i = 1,.., i

6 Construct Dual Problem Primal min f( r) r st.. g () r 0 i i = 1,, n f : R R diff and convex n g : R R diff convex i Dual max Lru (, ) = f( r) + α ( g( r)) ru, i= 1 st.. L( r, u) = f( r) + α ( g ( r)) = 0 r r i r i i= 1 α 0, i = 1,, i i i Math Magic requiring only Plug and Chug

7 Final Regression Problem The Dual SVR with kernel min αα, * 1 * * yy 2 i j i i i j i= 1 j= 1 ( α α )( α α ) K( x, i x j ). i ( * ) ( * α α ε α α ) y + i i i i i i s.. t ( * α ) i αi i= 1 C = 0 α α i = * i, i 0 1,.., Looks nasty but just standard Convex Quadratic Program

8 Intuition behind dual and capacity control? Why minimize error + w 2? y x

9 Regression Using SVM Classification y+ε y-ε y x

10 Regression using SVM Classification

11 Final Regression Function

12 Regularization Shrinks (Soft) Tube (like nu-svm, Schoelkopf et al 1998) Margin New Tube Original Tube 2ε

13 CACO-2 Data Human intestinal cell line Predicts drug absorption 27 molecules with tested permeability 718 descriptors generated Electronic TAE Shape/Property (PEST) Traditional (MOE)

14 Electron Density-Derived TAE-wavelet Descriptors 1 ) Surface properties are encoded on e/au 3 surface Breneman, C.M. and Rhem, M., J. Comp. Chem., 1997,18(2), p ) Histograms or wavelet encoded of surface properties give TAE property descriptors Histograms PIP (Local Ionization Potential) Wavelet Coefficients

15 PEST-Shape Descriptors: Surface Property-Encoded Ray Tracing TAE Internal Ray Reflection - low resolution scan Isosurface (portion removed) with 750 segments RENSSELAER

16 Shape-Aware Molecular Descriptors from Property/Segment-Length Distributions Segment length and point-of-incidence value form 2D-histogram Each bin of 2D-histogram becomes a hybrid descriptor 36 descriptors per hybrid length-property PIP vs Segment Length RENSSELAER

17 Benzodiazapine structure, TAE surface reconstruction and PEST shape/property signatures O N N Cl

18 Practical Issues Overfitting/Lack of data Feature selection Difficult validation Model/parameter selection Very high model variance Not confidence in any one model Robust SVM Methodology Bagged feature selection via sparse linear SVM Bagged RBF SVM for final model Model selection via pattern search Model mining for more information

19 SVM Methodology Constru Select Select Parameter C, ε, ρ Final Model Bag Mode Optimiz Model

20 Model Selection To choose SVM model parameters: Objective: C; Tube: ε; RBF Kernel: ρ Select evaluation function: 2 Q = (mean square error)/(true variance) Evaluate on out-of-sample data Validation set or leave-one-out Optimize using grid search or pattern search

21 Pattern or Direct Search Repeat Evaluate neighbors in grid If better neighbor then go to neighbor Else reduce grid size Until grid size is small enough

22 Boosting and Bagging Problems: Out-of-sample results don t guarantee good generalization. Different validation sets give different models Many local minima in pattern search. Solution = Bagging: Create several models Average results.

23 Bagged SVM (RBF) CACO2-718 Variables -3 Predicted RT (min) Test Q2 = Observed RT (min)

24 Feature Selection Using subset of descriptors can greatly improve results. Use your favorite selection method Linear SVM with 1-norm regularization 1-2-

25 1-norm is sparse (1, 0) = (1, 0) = = (, ) < (, ) = (1/2,1/2) (1,0)

26 Feature Selection via Sparse SVM/LP Construct linear µ-svm using 1-norm LP: min wb,, ε, z, z s. t Pick best C,µ for SVM Keep descriptors C i= 1 with nonzero coefficients ( * ) i i z + z + Cνε + * 1 ( ) i i i * i w i i ( ) * i, i, ε 0 i = 1,. w x w + b y + z ε x + b y z ε z z w > 0 i.,

27 Bagged Feature Selection Partition Training Data Training Set Validation Set Linear SVM Algorithm For Feature Selection Random Variable - r Repeat B times A Linear Regression Model Bag B Models and Obtain Subset of Features Make 20 models of the form ( ) ( ) ( ) w x- b = w x + w x w x + w r+ b with only a few w i

28 Bagged SVM (RBF) CACO2-31 Variables -3-4 Predicted RT (min) Test Q2 = Observed RT (min)

29 Model Mining Generate many equally valid models. Models are data. Mine the model data for trends. Visualize models for chemist: chemist can interact with modeling Generate hypotheses from model data: descriptor rankings and interpretations

30 Star Plot of ABSDRN6 ABSDRN6 is most weighted every bootstrap on average. molecule size. Negatively weighted. INTERPRETATION: Large not absorb well. Each Radius represents weight in one Length is magnitude of weight.

31 Starplot Caco2-31 Variables ABSDRN6 DRNB10 DRNB00 PIPB04 PEOE.VSA.FHYD PEOE.VSA.FNEG SlogP.VSA0 a.don KB11 PEOE.VSA.4 PEOE.VSA.FPOL PEOE.VSA.PPOS BNPB31 KB54 PEOE.VSA.FPPOS SlogP.VSA6 PIPMAX EP2 FUKB14 SMR.VSA2 ANGLEB45 apol BNPB50 SlogP.VSA9 pmiz BNP8 PIPB53 ABSFUKMIN BNPB21 ABSKMIN SIKIA

32 Chemistry In/Out Modeling Data +Descriptors Feature Selection Visualize Features Assess Chemistry Test Data SVM Model Chemistry Interpretation Construct SVM Nonlinear model Predict bioactivities

33 The flipped rule To investigate the relative importance of selected descriptors and their consistency w > 0, w < 0 If doesn t make sense. So eliminate flipped variables.

34 Bagged SVM (RBF) CACO2-15 Variables -3-4 Predicted RT (min) Test Q2 = Observed RT (min)

35 Visualization of feature selection results To investigate the relative importance of selected descriptors and their consistency

36 CACO2 15 Variables a.don DRNB10 PEOE.VSA.FNEG BNPB31 KB54 ABSDRN6 ABSKMIN FUKB14 SMR.VSA2 PEOE.VSA.FPPOS SIKIA SlogP.VSA0 ANGLEB45 DRNB00 pmiz

37 Star Plot of a.don a.don is most weighted variable Measures number of hydrogen Negatively weighted. Each Radius represents Length is of weights. INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.

38 Star Plot of SlogP.VSA0 SlogP.VSA0 2nd most weighted Reflects hydrophobicity of Positively weighted. INTERPRETATION: Hydrophobic molecules absorb more easily

39 Chemical Insights Hydrophobicity - a.don SIZE and Shape ABSDRN6, SMR.VSA2, ANGLEB45, PmiZ Large is bad. Flat is bad. Globular is good. Polarity PEOE.VSA.FPPOS, PEOE.VSA.FNEG: negative partial charge good. Correspond to conventional wisdom rule of 5.

40 Hybrid TAE/SHAPE Shape important overall factor DRNB10, DRNB00: del rho dot N BNP31: bare nuclear potential KB54: kinetic energy descriptors very large lipophilic molecules don t work FUKB14: Fukui Surface Interpretations difficult Point to chemistry challenges/hypotheses

41 Final SVM Approach Construct large set of descriptors. Perform feature selection: Sensitivity Analysis or SVM-LP Construct many SVM models Optimize using QP or LP Evaluate by Validation Set or Leave-one-out Select best models by grid or pattern search Bag best 9 models to create final function

42 Drug Discovery Results (LOO) Data # Sampl e # Var. Full # Var. FS (Avg) Q2 Full Q2 Caco Barrier FS HIV Cancer LCCK Aquasol

43 Conclusions Defined robust modeling methodology for QSAR type problems. Generates many valid models. Mine models for additional information. Model visualization allows chemistry in/out Can substitute your favorite feature selection/inference methodology. Generalizable to many inference/modeling tasks.

44 Bagged Predictive Model Achieve the better generalization performance construct a series of non-linear SVM models use the average of all models as final prediction to reduce variance

45 Bagged SVM (RBF) CACO2-718 Variables Average of 10 Models -3 Predicted RT (min) Test Q2 =.7073 Q2 is MSE scaled by variance Observed RT (min)

46 Feature Selection Using subset of descriptors can greatly improve results. Do feature selection using Linear SVM with 1-norm regularization 1-2-

47 Feature Selection via Sparse SVM/LP (Bi et al 2003) Construct linear µ-svm using 1-norm LP: min wb,, ε, z, z s. t Pick best C,µ for SVM Keep descriptors C i= 1 with nonzero coefficients ( * ) i i z + z + Cνε + * 1 ( ) i i i * i w i i ( ) * i, i, ε 0 i = 1,. w x w + b y + z ε x + b y z ε z z w > 0 i.,

48 Bagged Variable Selection Partition Training Data Training Set Validation Set Linear SVM Algorithm For Feature Selection Random Variable - r Repeat B times A Linear Regression Model Bag B Models and Obtain Subset of Features Make 20 models of the form ( ) ( ) ( ) w x- b = w x + w x w x + w r+ b with only a few w 0 Keep attributes with w i i > w r r

49 Bagged Variable Selection Random Variables DATASET Training set Test set Bootstrap sample k Training Sparse Linear SVM descriptors Validation Tuning / Prediction Reduced Data Predictive Model Nonlinear SVM Prediction

50 Star Plot of a.don Measures number of hydrogen Negatively weighted. INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.

51 Caco-2 14 Features (SVM) Each star represents a descriptor a.don DRNB10 PEOE.VSA.FNEG BNPB31 Each ray is a separate bootstrap KB54 ABSDRN6 ABSKMIN FUKB14 The area of a star represents the relative importance of that descriptor Descriptors shaded cyan have a negative effect SMR.VSA2 PEOE.VSA.FPPOS SIKIA SlogP.VSA0 Unshaded ones have a positive effect ANGLEB45 DRNB00 Hydrophobicity - a.don Size and Shape - ABSDRN6, SMR.VSA2, ANGLEB45 Large is bad. Flat is bad. Globular is good. Polarity PEOE.VSA...: negative partial charge good.

52 Bagged SVM (RBF) Caco Train R cv 2 = 0.93 Blind Test R 2 = Before feature selection R 2 =.66

N. SUKUMAR Curt M. Breneman, Mark J. Embrechts, Kristin P. Bennett and Dechuan Zhuang

Electron density derived descriptors in ADME/Tox screening Presented by N. SUKUMAR Curt M. Breneman, Mark J. Embrechts, Kristin P. Bennett and Dechuan Zhuang http://www.drugmining.com/ Copyright, 2005