Project topics for the course Special Course in Bioinformatics II: Machine Learning in Bioinformatics

Size: px

Start display at page:

Download "Project topics for the course Special Course in Bioinformatics II: Machine Learning in Bioinformatics"

Ralph Gerald McGee
6 years ago
Views:

1 Project topics for the course Special Course in Bioinformatics II: Machine Learning in Bioinformatics Eric Bach, Céline Brouard, Anna Cichonska, Markus Heinonen, Huibin Shen, Juho Rousu March 27, Retention time prediction using kernel methods Eric Bach (eric.bach@aalto.fi) Background: In untargeted metabolomics studies complex biological sample with possibly thousands of molecules are encountered. Tandem mass spectrometry (MS/MS) is a widely used technique to extract patterns from biological samples to identify the molecules in it. However, the sensitivity of a mass spectrometer depends on the ability to reduce the complexity of the biological sample, e.g. to prevent MS/MS spectra representing more than one molecule. Liquid chromatography (LC) is a technique to do such complexity reduction. If a properly prepared biological sample is provided to a LC column the molecules in the sample will interact differently with the columns stationary phase. This makes the molecules separating as a function of time depending on their molecular properties. Some molecules are passing faster through the column than others. The time at which a molecule leaves the column is called the retention time. The retention time can serve as an orthogonal information for the metabolite identification, e.g. it can exclude molecular candidates which are expected to have a different retention time [Aic+15] or make distinction of diastereoisomers possible [SNV15]. Unfortunately, retention time measurements are only available for a small number of molecules and not comparable between different chromatographic systems. On the other hand, for example the set of molecular candidates for the identification of one molecule (given its MS/MS spectra) can possible contain thousands of molecules. Therefore, machine learning algorithms have been applied to predict retention times given the structure of a molecule [Aic+15; Fal+16]. 1

2 Goal: In this project the student will implement and apply two different kernelized regression approaches to predict the retention time of molecules given their structure. Methods and materials: For the project the student will be provided with a data set containing the retention time measurements for 596 molecules. The molecular descriptors and fingerprints will be given to the student. The student will implement the Kernel Ridge Regression (KRR) and the Magnitude-preserving kernel regression (MPKR) [CMR07]. The student will apply both approaches to predict the retention times for the molecular structures in the data set. The student will compare the performance of KRR and MPKR and investigate, whether the magnitude-preserving error term leads to better retention time prediction. Prerequisite: Basic knowledge of machine learning (especially kernel methods) & parameter estimation (i.e. cross-validation), linear algebra, programming skills in R, MATLAB or Python. Some basic knowledge of molecular biology and chemoinformatics is beneficial. [Aic+15] Fabian Aicheler et al. Retention Time Prediction Improves Identification in Nontargeted Lipidomics Approaches. In: Analytical chemistry (2015), pp [CMR07] Corinna Cortes et al. Magnitude-preserving Ranking Algorithms. In: Proceedings of the 24th International Conference on Machine Learning. ICML 07. ACM, url: [Fal+16] Federico Falchi et al. Kernel-Based, Partial Least Squares Quantitative Structure- Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool for Metabolite Identification. In: Analytical Chemistry (2016). [SNV15] Jan Stanstrup et al. PredRet: Prediction of Retention Time by Direct Mapping between Multiple Chromatographic Systems. In: Analytical Chemistry (2015). PMID: , pp url: 2

3 2 Metabolite identification from tandem mass spectra Céline Brouard Background: Metabolites are small molecules involved in the biological processes of organisms. Metabolite identification is an important problem in molecular biology. This problem consists in identifying the molecular structures of the unknown metabolites that are present in a biological sample. Information on these unknown metabolites can be obtained using tandem mass spectrometry. Recent progress in metabolite identification has been obtained using machine learning-based methods. Goal: The goal of this project is to implement the CSI:FingerID method described in the lecture. This method will be applied on the dataset used in the last CASMI 1 (Critical Assessment of Small Molecule Identification) contest. The idea of this contest is to evaluate different metabolite identification methods on a common dataset. A set of training examples is provided and for each given tandem mass spectrum, the correct molecular structure has to be determined among a set of potential molecular candidates. Materials and Methods: In this project, the student will implement the CSI:FingerID method. During the learning phase, the training MS/MS spectra are used to train a set of Support Vector Machine classifiers to predict molecular properties. The parameter C in SVM will be tuned using k-fold cross-validation on the training set, independently for each molecular property. In the prediction phase, the fingerprints of the unknown metabolites are predicted from their MS/MS spectra. The predicted fingerprints are then compared to fingerprints of candidate molecular structures for a best match. The training dataset contains 234 tandem mass spectra and the challenge dataset consists of 127 tandem mass spectra. A list of candidates is provided for each challenge spectrum. For each molecule, fingerprints have been retrieved from PubChem and OpenBabel. In input, kernels on tandem mass spectra will be provided. Required background knowledge/skills: Programming skills (preferably MATLAB, or R), basic knowledge of machine learning, understanding the basic principles of support vector machines. Some knowledge of molecular biology will be beneficial. [1] Heinonen, M., Shen, H., Zamboni, N., and Rousu, J. (2012). Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics, 28 (18): [2] Shen, H,, Dührkop, K., Böcker, S. and Rousu, J. (2014). Metabolite identification through 1 3

4 multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157-i164. [3] Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences, 112(41): Multiple kernel learning for drug-protein interaction prediction Anna Cichonska (anna.cichonska@aalto.fi) Background: Drug-like chemical compounds execute their actions mainly by modulating cellular targets, such as proteins. Experimental determination of interactions between chemical compounds and protein targets is time consuming and expensive, and therefore, in the recent years, a lot of effort has been placed on the development of computational methods that could provide fast, large-scale and systematic pre-screening of chemical probes. In particular, a lot of work has been devoted to compound-based interaction prediction methods, including quantitative structure-activity relationship (QSAR) models, which aim to relate structural properties of the chemical molecules to their bioactivity profiles. Another class of computational methods, so called target-based methods, focus on evaluating similarities between amino acid sequences or three-dimensional structures of protein targets. In these supervised learning approaches, models are trained using available bioactivity data, together with either compound or protein information, which allows then predicting either new targets of a given drug or new drugs targeting a given protein. As a more recent class of computational modelling approaches, systems-based frameworks take advantage of the information available on both compounds and proteins. A key assumption is that similar drug compounds interact with similar proteins, and therefore a proper representation and use of similarities, equivalent to a kernel choice, is a first critical prerequisite for the achievement of high-quality drug-protein interaction (DPI) predictions. Classical kernel-based methods rely on a single kernel. However, such approaches are unlikely to be optimal when a growing variety of biological and molecular data sources become available simultaneously. Multiple kernel learning (MKL) methods, which search for an optimal combination of several kernels, enabling the use of different information sources simultaneously and learning their importance for the prediction task, are therefore receiving increasing attention. Typically, binary-valued DPI prediction setup is employed. However, molecular interactions are not simple on-off relationships and predicting real-valued binding affinities is more appealing. 4

5 Goal: The goal of the project is to compute several protein kernels as well as drug kernels, and then use them in MKL regression framework to predict drug-protein binding affinities. Materials and Methods: The data set consists of 50 drug compounds and 50 protein targets, which is a subset of the data from Metz et al. (2011) experimental study. DPIs are represented as real values reflecting how tightly a compound binds to a protein. The student will calculate Tanimoto kernels for drug compounds based on several fingerprints implemented in ChemmineR R package. For proteins, Smith-Waterman amino acid sequence alignment as well as Generic String kernel will be adopted. The student can also choose to compute other molecular descriptors. Then, pairwise kernels that directly relate drugprotein pairs will be constructed by taking Kronecker product of each pair of drug kernel and protein kernel. The student will use pairwise kernels with two-stage MKL algorithm ALIGNF. In the first stage, kernel mixture weights are determined based on maximising the centred alignment, i.e. matrix similarity measure, between the combined kernel and the ideal, socalled target kernel derived from the label values. In the second stage, combined kernel is used with Kernel Ridge Regression (KRR) as a prediction model. The student will be provided a script for calculating kernel mixture weights (first stage) but should implement KRR (second stage). UNIMKL algorithm will form a baseline model, where all kernel mixture weights are equal to 1/P, P being the number of input kernels. The student will implement nested cross validation to tune the regularisation parameters λ of KRR and asses the predictive performance of the model. Prerequisite: Programming skills (MATLAB, R, Python), basic knowledge of machine learning. Some knowledge of chemoinformatics will be beneficial. [1] Ding H, Takigawa I, Mamitsuka H, Zhu S. Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics 2014; 15(5): [2] Cichonska A, Rousu J, Aittokallio T. Identification of drug candidates and repurposing opportunities through compoundtarget interaction networks. Expert Opinion on Drug Discovery 2015; 10(12): [3] Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 2008; 24(13): i [4] Giguere S, Marchand M, Laviolette F, Drouin A, Corbeil J. Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC Bioinformatics 2013; 14(1): 82. [5] Cortes C, Mohri M, Rostamizadeh A. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research 2012; 13(Mar): [6] Metz JT, Johnson EF, Soni NB et al. Navigating the kinome. Nature Chemical Biology 2011; 7(4):

6 4 Differential gene expression analysis Markus Heinonen Background: In differential gene expression analysis statistical methods are applied to find which genes are over or under expressed with respect to control baseline expression levels. These results are subsequently analysed for biological significance by inspect the functional annotations of these genes to gain insight into cellular processes of interests. In static differential testing expression matrices are compared using well-defined statistics. In dynamic differential testing time series or interpolation models over time are compared using frequentist or Bayesian statistics. In both cases a large-scale view of the expression patterns of thousands of genes emerges. The key question in differential analysis is choice of model for the expression patterns. Genes commonly exhibit non-stationarity, where the underlying dynamics can change abruptly by perturbation or regulation. The sparse and often irregularly sampled data warrants careful modeling of the signals. Typically the underlying model family for interpolation and data representation are Gaussian processes. The differential expression can be tested against a constant level, between two conditions, or between multiple conditions. Goal: The goal of the project is to model gene expression time series with Gaussian processes and apply differential testing to find differentially regulated genes between conditions. Materials and methods: In this project the response of Botrytis infection on Arabidopsis plant gene expression is analysed. The gene expression time-series are modeled using Gaussian processes and two-sample interval testing is carried out to find out differentially expressed genes in the infection response, and when these genes are differentially expressed. The analysis results in a temporal cascade of gene differential expressions. The plant gene expression measurements are large-scale and of high quality with numerous biological and technical replicates. The data is located in the GEO database at The dataset consists 22 time points (2,4,..,48 hours) for infected and normal plant cells, for 4 biological replicates (plants) and 3 technical replicates for almost 10,000 gene probes. The GP modeling can be performed on any GP implementation (eg. gpml/gpstuff on Matlab, gptk/gpfit on R, pygp/gpy on Python). A suitable learning criteria, such as marginal likelihood or cross-validation should be used. An appropriate kernel prior should be chosen as well, with the Gaussian kernel being a common choice. A two-sample testing should be implemented according to the Bayesian EMLL framework (see slides). Finally, the differentially expressed genes can be studied by many ways. These include visualisation over time, clustering of their expression patterns, or by considering their functional classifications (such as GO terms, KEGG pathways, Inter- Pro families or PANTHER functional classification), which are found in several databases, for instance the DAVID and BioGPS web servers. Optionally, the student can experiment with non-stationary GPs, where the observation noise or signal variance is time dependent. The GPstuff package contains an implementation 6

7 of nonstationary GPs. The goal is to analyse which gene expression time series warrant a non-stationary GP, and to analyse the model improvement and runtime effects from adding non-stationarity [See Tolvanen et al 2014]. More detailed instructions will be available from the instructor. Required background knowledge/skills: Programming skills (Matlab, R, python), basic statistics, basic Bayesian statistics and machine learning. Some knowledge of biology will be useful. Heinonen et al (2015): Detecting time periods of differential gene expression using Gaussian processes: An application to endothelial cells exposed to radiotherapy dose fraction. Bioinformatics, 31: Rasmussen & Williams (2006): Gaussian processes for machine learning [sections 2, 4.2 and 5.4]. Windram et al (2012) Arabidopsis Defense against Botrytis cinerea: Chronology and Regulation Deciphered by High-Resolution Temporal Transcriptomic Analysis. The Plant Cell, 24: Stegle et al (2010): A Robust Bayesian Two-Sample Test for Detecting Intervals of Differential Gene Expression in Microarray Time Series. Journal of Computational Biology, 17: Tolvanen et al (2014). Expectation propagation for nonstationary heteroscedastic Gaussian process regression. in IEEE MLSP. 7

8 5 Learning molecular representation with an autoencoder Huibin Shen Background: The current representations of molecule including a binary vector representation such as molecular fingerprint, a string representation such as InChi or SMILES, or 2d/3d graph. Many applications related to molecules are based on some kind of representation. The popular deep learning is at the core to learn a better representation for the data. The number of molecules in nowadays compound database is in the scale of millions. With the heated deep learning approach, to learn a compact and continuous vector representation is possible. Goal: In this project, we will use an variational autoencoder to learn such representation and test the representation in a metabolite identification pipeline. We will first test the autoencoder on a subset of 5M molecules with fingerprint representation or SMILES string representation. The code and data is already available. The student will run the code on GPU nodes on triton. Prerequisite: Python and Basic knowledge about machine learning and deep learning. [1] Gómez-Bombarelli, R., Duvenaud, D., Hernndez-Lobato, J. M., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. (2016). Automatic chemical design using a data-driven continuous representation of molecules. arxiv preprint arxiv: [2] Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural network for modelling sentences. arxiv preprint arxiv: [3] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arxiv preprint arxiv:

Machine learning for ligand-based virtual screening and chemogenomics!

Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds: