Project topics for the course Special Course in Bioinformatics II: Machine Learning in Bioinformatics
|
|
- Ralph Gerald McGee
- 6 years ago
- Views:
Transcription
1 Project topics for the course Special Course in Bioinformatics II: Machine Learning in Bioinformatics Eric Bach, Céline Brouard, Anna Cichonska, Markus Heinonen, Huibin Shen, Juho Rousu March 27, Retention time prediction using kernel methods Eric Bach (eric.bach@aalto.fi) Background: In untargeted metabolomics studies complex biological sample with possibly thousands of molecules are encountered. Tandem mass spectrometry (MS/MS) is a widely used technique to extract patterns from biological samples to identify the molecules in it. However, the sensitivity of a mass spectrometer depends on the ability to reduce the complexity of the biological sample, e.g. to prevent MS/MS spectra representing more than one molecule. Liquid chromatography (LC) is a technique to do such complexity reduction. If a properly prepared biological sample is provided to a LC column the molecules in the sample will interact differently with the columns stationary phase. This makes the molecules separating as a function of time depending on their molecular properties. Some molecules are passing faster through the column than others. The time at which a molecule leaves the column is called the retention time. The retention time can serve as an orthogonal information for the metabolite identification, e.g. it can exclude molecular candidates which are expected to have a different retention time [Aic+15] or make distinction of diastereoisomers possible [SNV15]. Unfortunately, retention time measurements are only available for a small number of molecules and not comparable between different chromatographic systems. On the other hand, for example the set of molecular candidates for the identification of one molecule (given its MS/MS spectra) can possible contain thousands of molecules. Therefore, machine learning algorithms have been applied to predict retention times given the structure of a molecule [Aic+15; Fal+16]. 1
2 Goal: In this project the student will implement and apply two different kernelized regression approaches to predict the retention time of molecules given their structure. Methods and materials: For the project the student will be provided with a data set containing the retention time measurements for 596 molecules. The molecular descriptors and fingerprints will be given to the student. The student will implement the Kernel Ridge Regression (KRR) and the Magnitude-preserving kernel regression (MPKR) [CMR07]. The student will apply both approaches to predict the retention times for the molecular structures in the data set. The student will compare the performance of KRR and MPKR and investigate, whether the magnitude-preserving error term leads to better retention time prediction. Prerequisite: Basic knowledge of machine learning (especially kernel methods) & parameter estimation (i.e. cross-validation), linear algebra, programming skills in R, MATLAB or Python. Some basic knowledge of molecular biology and chemoinformatics is beneficial. [Aic+15] Fabian Aicheler et al. Retention Time Prediction Improves Identification in Nontargeted Lipidomics Approaches. In: Analytical chemistry (2015), pp [CMR07] Corinna Cortes et al. Magnitude-preserving Ranking Algorithms. In: Proceedings of the 24th International Conference on Machine Learning. ICML 07. ACM, url: [Fal+16] Federico Falchi et al. Kernel-Based, Partial Least Squares Quantitative Structure- Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool for Metabolite Identification. In: Analytical Chemistry (2016). [SNV15] Jan Stanstrup et al. PredRet: Prediction of Retention Time by Direct Mapping between Multiple Chromatographic Systems. In: Analytical Chemistry (2015). PMID: , pp url: 2
3 2 Metabolite identification from tandem mass spectra Céline Brouard Background: Metabolites are small molecules involved in the biological processes of organisms. Metabolite identification is an important problem in molecular biology. This problem consists in identifying the molecular structures of the unknown metabolites that are present in a biological sample. Information on these unknown metabolites can be obtained using tandem mass spectrometry. Recent progress in metabolite identification has been obtained using machine learning-based methods. Goal: The goal of this project is to implement the CSI:FingerID method described in the lecture. This method will be applied on the dataset used in the last CASMI 1 (Critical Assessment of Small Molecule Identification) contest. The idea of this contest is to evaluate different metabolite identification methods on a common dataset. A set of training examples is provided and for each given tandem mass spectrum, the correct molecular structure has to be determined among a set of potential molecular candidates. Materials and Methods: In this project, the student will implement the CSI:FingerID method. During the learning phase, the training MS/MS spectra are used to train a set of Support Vector Machine classifiers to predict molecular properties. The parameter C in SVM will be tuned using k-fold cross-validation on the training set, independently for each molecular property. In the prediction phase, the fingerprints of the unknown metabolites are predicted from their MS/MS spectra. The predicted fingerprints are then compared to fingerprints of candidate molecular structures for a best match. The training dataset contains 234 tandem mass spectra and the challenge dataset consists of 127 tandem mass spectra. A list of candidates is provided for each challenge spectrum. For each molecule, fingerprints have been retrieved from PubChem and OpenBabel. In input, kernels on tandem mass spectra will be provided. Required background knowledge/skills: Programming skills (preferably MATLAB, or R), basic knowledge of machine learning, understanding the basic principles of support vector machines. Some knowledge of molecular biology will be beneficial. [1] Heinonen, M., Shen, H., Zamboni, N., and Rousu, J. (2012). Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics, 28 (18): [2] Shen, H,, Dührkop, K., Böcker, S. and Rousu, J. (2014). Metabolite identification through 1 3
4 multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157-i164. [3] Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences, 112(41): Multiple kernel learning for drug-protein interaction prediction Anna Cichonska (anna.cichonska@aalto.fi) Background: Drug-like chemical compounds execute their actions mainly by modulating cellular targets, such as proteins. Experimental determination of interactions between chemical compounds and protein targets is time consuming and expensive, and therefore, in the recent years, a lot of effort has been placed on the development of computational methods that could provide fast, large-scale and systematic pre-screening of chemical probes. In particular, a lot of work has been devoted to compound-based interaction prediction methods, including quantitative structure-activity relationship (QSAR) models, which aim to relate structural properties of the chemical molecules to their bioactivity profiles. Another class of computational methods, so called target-based methods, focus on evaluating similarities between amino acid sequences or three-dimensional structures of protein targets. In these supervised learning approaches, models are trained using available bioactivity data, together with either compound or protein information, which allows then predicting either new targets of a given drug or new drugs targeting a given protein. As a more recent class of computational modelling approaches, systems-based frameworks take advantage of the information available on both compounds and proteins. A key assumption is that similar drug compounds interact with similar proteins, and therefore a proper representation and use of similarities, equivalent to a kernel choice, is a first critical prerequisite for the achievement of high-quality drug-protein interaction (DPI) predictions. Classical kernel-based methods rely on a single kernel. However, such approaches are unlikely to be optimal when a growing variety of biological and molecular data sources become available simultaneously. Multiple kernel learning (MKL) methods, which search for an optimal combination of several kernels, enabling the use of different information sources simultaneously and learning their importance for the prediction task, are therefore receiving increasing attention. Typically, binary-valued DPI prediction setup is employed. However, molecular interactions are not simple on-off relationships and predicting real-valued binding affinities is more appealing. 4
5 Goal: The goal of the project is to compute several protein kernels as well as drug kernels, and then use them in MKL regression framework to predict drug-protein binding affinities. Materials and Methods: The data set consists of 50 drug compounds and 50 protein targets, which is a subset of the data from Metz et al. (2011) experimental study. DPIs are represented as real values reflecting how tightly a compound binds to a protein. The student will calculate Tanimoto kernels for drug compounds based on several fingerprints implemented in ChemmineR R package. For proteins, Smith-Waterman amino acid sequence alignment as well as Generic String kernel will be adopted. The student can also choose to compute other molecular descriptors. Then, pairwise kernels that directly relate drugprotein pairs will be constructed by taking Kronecker product of each pair of drug kernel and protein kernel. The student will use pairwise kernels with two-stage MKL algorithm ALIGNF. In the first stage, kernel mixture weights are determined based on maximising the centred alignment, i.e. matrix similarity measure, between the combined kernel and the ideal, socalled target kernel derived from the label values. In the second stage, combined kernel is used with Kernel Ridge Regression (KRR) as a prediction model. The student will be provided a script for calculating kernel mixture weights (first stage) but should implement KRR (second stage). UNIMKL algorithm will form a baseline model, where all kernel mixture weights are equal to 1/P, P being the number of input kernels. The student will implement nested cross validation to tune the regularisation parameters λ of KRR and asses the predictive performance of the model. Prerequisite: Programming skills (MATLAB, R, Python), basic knowledge of machine learning. Some knowledge of chemoinformatics will be beneficial. [1] Ding H, Takigawa I, Mamitsuka H, Zhu S. Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics 2014; 15(5): [2] Cichonska A, Rousu J, Aittokallio T. Identification of drug candidates and repurposing opportunities through compoundtarget interaction networks. Expert Opinion on Drug Discovery 2015; 10(12): [3] Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 2008; 24(13): i [4] Giguere S, Marchand M, Laviolette F, Drouin A, Corbeil J. Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC Bioinformatics 2013; 14(1): 82. [5] Cortes C, Mohri M, Rostamizadeh A. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research 2012; 13(Mar): [6] Metz JT, Johnson EF, Soni NB et al. Navigating the kinome. Nature Chemical Biology 2011; 7(4):
6 4 Differential gene expression analysis Markus Heinonen Background: In differential gene expression analysis statistical methods are applied to find which genes are over or under expressed with respect to control baseline expression levels. These results are subsequently analysed for biological significance by inspect the functional annotations of these genes to gain insight into cellular processes of interests. In static differential testing expression matrices are compared using well-defined statistics. In dynamic differential testing time series or interpolation models over time are compared using frequentist or Bayesian statistics. In both cases a large-scale view of the expression patterns of thousands of genes emerges. The key question in differential analysis is choice of model for the expression patterns. Genes commonly exhibit non-stationarity, where the underlying dynamics can change abruptly by perturbation or regulation. The sparse and often irregularly sampled data warrants careful modeling of the signals. Typically the underlying model family for interpolation and data representation are Gaussian processes. The differential expression can be tested against a constant level, between two conditions, or between multiple conditions. Goal: The goal of the project is to model gene expression time series with Gaussian processes and apply differential testing to find differentially regulated genes between conditions. Materials and methods: In this project the response of Botrytis infection on Arabidopsis plant gene expression is analysed. The gene expression time-series are modeled using Gaussian processes and two-sample interval testing is carried out to find out differentially expressed genes in the infection response, and when these genes are differentially expressed. The analysis results in a temporal cascade of gene differential expressions. The plant gene expression measurements are large-scale and of high quality with numerous biological and technical replicates. The data is located in the GEO database at The dataset consists 22 time points (2,4,..,48 hours) for infected and normal plant cells, for 4 biological replicates (plants) and 3 technical replicates for almost 10,000 gene probes. The GP modeling can be performed on any GP implementation (eg. gpml/gpstuff on Matlab, gptk/gpfit on R, pygp/gpy on Python). A suitable learning criteria, such as marginal likelihood or cross-validation should be used. An appropriate kernel prior should be chosen as well, with the Gaussian kernel being a common choice. A two-sample testing should be implemented according to the Bayesian EMLL framework (see slides). Finally, the differentially expressed genes can be studied by many ways. These include visualisation over time, clustering of their expression patterns, or by considering their functional classifications (such as GO terms, KEGG pathways, Inter- Pro families or PANTHER functional classification), which are found in several databases, for instance the DAVID and BioGPS web servers. Optionally, the student can experiment with non-stationary GPs, where the observation noise or signal variance is time dependent. The GPstuff package contains an implementation 6
7 of nonstationary GPs. The goal is to analyse which gene expression time series warrant a non-stationary GP, and to analyse the model improvement and runtime effects from adding non-stationarity [See Tolvanen et al 2014]. More detailed instructions will be available from the instructor. Required background knowledge/skills: Programming skills (Matlab, R, python), basic statistics, basic Bayesian statistics and machine learning. Some knowledge of biology will be useful. Heinonen et al (2015): Detecting time periods of differential gene expression using Gaussian processes: An application to endothelial cells exposed to radiotherapy dose fraction. Bioinformatics, 31: Rasmussen & Williams (2006): Gaussian processes for machine learning [sections 2, 4.2 and 5.4]. Windram et al (2012) Arabidopsis Defense against Botrytis cinerea: Chronology and Regulation Deciphered by High-Resolution Temporal Transcriptomic Analysis. The Plant Cell, 24: Stegle et al (2010): A Robust Bayesian Two-Sample Test for Detecting Intervals of Differential Gene Expression in Microarray Time Series. Journal of Computational Biology, 17: Tolvanen et al (2014). Expectation propagation for nonstationary heteroscedastic Gaussian process regression. in IEEE MLSP. 7
8 5 Learning molecular representation with an autoencoder Huibin Shen Background: The current representations of molecule including a binary vector representation such as molecular fingerprint, a string representation such as InChi or SMILES, or 2d/3d graph. Many applications related to molecules are based on some kind of representation. The popular deep learning is at the core to learn a better representation for the data. The number of molecules in nowadays compound database is in the scale of millions. With the heated deep learning approach, to learn a compact and continuous vector representation is possible. Goal: In this project, we will use an variational autoencoder to learn such representation and test the representation in a metabolite identification pipeline. We will first test the autoencoder on a subset of 5M molecules with fingerprint representation or SMILES string representation. The code and data is already available. The student will run the code on GPU nodes on triton. Prerequisite: Python and Basic knowledge about machine learning and deep learning. [1] Gómez-Bombarelli, R., Duvenaud, D., Hernndez-Lobato, J. M., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. (2016). Automatic chemical design using a data-driven continuous representation of molecules. arxiv preprint arxiv: [2] Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural network for modelling sentences. arxiv preprint arxiv: [3] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arxiv preprint arxiv:
Machine learning for ligand-based virtual screening and chemogenomics!
Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationFast metabolite identification with Input Output Kernel Regression
Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Manuscript Category Fast metabolite identification with Input Output Kernel Regression Céline Brouard 1,2,,
More informationK-means-based Feature Learning for Protein Sequence Classification
K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationPlan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.
Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016)
CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationTUTORIAL PART 1 Unsupervised Learning
TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew
More informationPredicting Protein Functions and Domain Interactions from Protein Interactions
Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput
More informationModelling gene expression dynamics with Gaussian processes
Modelling gene expression dynamics with Gaussian processes Regulatory Genomics and Epigenomics March th 6 Magnus Rattray Faculty of Life Sciences University of Manchester Talk Outline Introduction to Gaussian
More informationBioinformatics 2. Yeast two hybrid. Proteomics. Proteomics
GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein
More informationMachine Learning Concepts in Chemoinformatics
Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics
More informationComputational Methods for Mass Spectrometry Proteomics
Computational Methods for Mass Spectrometry Proteomics Eidhammer, Ingvar ISBN-13: 9780470512975 Table of Contents Preface. Acknowledgements. 1 Protein, Proteome, and Proteomics. 1.1 Primary goals for studying
More informationBackground: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry
Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what
More informationJoint Emotion Analysis via Multi-task Gaussian Processes
Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions
More informationSTRUCTURAL BIOINFORMATICS I. Fall 2015
STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;
More informationPrediction of double gene knockout measurements
Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair
More informationProteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?
Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains
More informationMULTIPLEKERNELLEARNING CSE902
MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning -keywords Heterogeneous information fusion Feature selection Max-margin classification Multiple kernel learning MKL Convex optimization Kernel classification
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationLink Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach Masahito Ohue 1,2,3,4*, akuro Yamazaki 3, omohiro Ban 4, and Yutaka Akiyama 1,2,3,4* 1 Department of
More information#33 - Genomics 11/09/07
BCB 444/544 Required Reading (before lecture) Lecture 33 Mon Nov 5 - Lecture 31 Phylogenetics Parsimony and ML Chp 11 - pp 142 169 Genomics Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33
More informationLearning Molecular Fingerprints from the Graph Up
Learning Molecular Fingerprints from the Graph Up David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams Motivation Want
More informationChemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller
Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics
More informationA Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery
AtomNet A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Izhar Wallach, Michael Dzamba, Abraham Heifets Victor Storchan, Institute for Computational and
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationFew-shot learning with KRR
Few-shot learning with KRR Prudencio Tossou Groupe de Recherche en Apprentissage Automatique Départment d informatique et de génie logiciel Université Laval April 6, 2018 Prudencio Tossou (UL) Few-shot
More informationMachine learning methods to infer drug-target interaction network
Machine learning methods to infer drug-target interaction network Yoshihiro Yamanishi Medical Institute of Bioregulation Kyushu University Outline n Background Drug-target interaction network Chemical,
More informationThe Success of Deep Generative Models
The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability
More informationMagnitude-Preserving Ranking for Structured Outputs
Proceedings of Machine Learning Research 77:407 422, 2017 ACML 2017 Magnitude-Preserving Ranking for Structured Outputs Céline Brouard 1 celine.brouard@aalto.fi Eric Bach 1 eric.bach@aalto.fi Sebastian
More informationComputational methods for predicting protein-protein interactions
Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational
More informationBayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen
Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008 Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al:
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationIn silico pharmacology for drug discovery
In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationStructure-Activity Modeling - QSAR. Uwe Koch
Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity
More informationCompounding insights Thermo Scientific Compound Discoverer Software
Compounding insights Thermo Scientific Compound Discoverer Software Integrated, complete, toolset solves small-molecule analysis challenges Thermo Scientific Orbitrap mass spectrometers produce information-rich
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationKernel Methods in Machine Learning
Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)
More information10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification
10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene
More informationBayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition
Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Mar Girolami 1 Department of Computing Science University of Glasgow girolami@dcs.gla.ac.u 1 Introduction
More informationDe Novo molecular design with Deep Reinforcement Learning
De Novo molecular design with Deep Reinforcement Learning @olexandr Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill olexandr@unc.edu http://olexandrisayev.com About me Ph.D. in Chemistry
More informationIntroduction to Chemoinformatics and Drug Discovery
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013 The Chemical Space There are atoms and space. Everything else is opinion. Democritus (ca.
More informationBioinformatics. Dept. of Computational Biology & Bioinformatics
Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationDiscriminating precursors of common fragments for large-scale metabolite profiling by triple quadrupole mass spectrometry
Bioinformatics, 31(12), 2015, 2017 2023 doi: 10.1093/bioinformatics/btv085 Advance Access Publication Date: 16 February 2015 Original Paper Systems biology Discriminating precursors of common fragments
More informationBackground: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of
Background: Imagine it is time for your lunch break, you take your sandwich outside and you sit down to enjoy your lunch with a beautiful view of Montana s Rocky Mountains. As you look up, you see what
More informationSupport vector machines, Kernel methods, and Applications in bioinformatics
1 Support vector machines, Kernel methods, and Applications in bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group Machine Learning in Bioinformatics conference,
More informationNavigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland
Navigation in Chemical Space Towards Biological Activity Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland Data Explosion in Chemistry CAS 65 million molecules CCDC 600 000 structures
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationNetworks & pathways. Hedi Peterson MTAT Bioinformatics
Networks & pathways Hedi Peterson (peterson@quretec.com) MTAT.03.239 Bioinformatics 03.11.2010 Networks are graphs Nodes Edges Edges Directed, undirected, weighted Nodes Genes Proteins Metabolites Enzymes
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationComputational Genomics
Computational Genomics http://www.cs.cmu.edu/~02710 Introduction to probability, statistics and algorithms (brief) intro to probability Basic notations Random variable - referring to an element / event
More informationprofileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research
profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis
More informationOutline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren
1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information
More informationCSCE555 Bioinformatics. Protein Function Annotation
CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The
More informationarxiv: v1 [stat.ml] 6 Dec 2018
missiwae: Deep Generative Modelling and Imputation of Incomplete Data arxiv:1812.02633v1 [stat.ml] 6 Dec 2018 Pierre-Alexandre Mattei Department of Computer Science IT University of Copenhagen pima@itu.dk
More informationPrediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines
Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationBST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1
BST 226 Statistical Methods for Bioinformatics David M. Rocke January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1 Mass Spectrometry Mass spectrometry (mass spec, MS) comprises a set of instrumental
More informationMachine Learning! in just a few minutes. Jan Peters Gerhard Neumann
Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often
More informationGaussian Processes: We demand rigorously defined areas of uncertainty and doubt
Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt ACS Spring National Meeting. COMP, March 16 th 2016 Matthew Segall, Peter Hunt, Ed Champness matt.segall@optibrium.com Optibrium,
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationXia Ning,*, Huzefa Rangwala, and George Karypis
J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets
More informationMaximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine
Maximum Direction to Geometric Mean Spectral Response Ratios using the Relevance Vector Machine Y. Dak Hazirbaba, J. Tezcan, Q. Cheng Southern Illinois University Carbondale, IL, USA SUMMARY: The 2009
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationComputational Genomics. Systems biology. Putting it together: Data integration using graphical models
02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More informationHYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH
HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationBAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS
BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS Oloyede I. Department of Statistics, University of Ilorin, Ilorin, Nigeria Corresponding Author: Oloyede I.,
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSUSPECT AND NON-TARGET SCREENING OF ORGANIC MICROPOLLUTANTS IN WASTEWATER THROUGH THE DEVELOPMENT OF A LC-HRMS BASED WORKFLOW
SUSPECT AND NON-TARGET SCREENING OF ORGANIC MICROPOLLUTANTS IN WASTEWATER THROUGH THE DEVELOPMENT OF A LC-HRMS BASED WORKFLOW Pablo Gago-Ferrero Laboratory of Analytical Chemistry Department of Chemistry
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationSUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH
SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH Ashutosh Kumar Singh 1, S S Sahu 2, Ankita Mishra 3 1,2,3 Birla Institute of Technology, Mesra, Ranchi Email: 1 ashutosh.4kumar.4singh@gmail.com,
More informationBMD645. Integration of Omics
BMD645 Integration of Omics Shu-Jen Chen, Chang Gung University Dec. 11, 2009 1 Traditional Biology vs. Systems Biology Traditional biology : Single genes or proteins Systems biology: Simultaneously study
More informationCorrelation Autoencoder Hashing for Supervised Cross-Modal Search
Correlation Autoencoder Hashing for Supervised Cross-Modal Search Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu School of Software Tsinghua University The Annual ACM International Conference on Multimedia
More informationChemical Data Retrieval and Management
Chemical Data Retrieval and Management ChEMBL, ChEBI, and the Chemistry Development Kit Stephan A. Beisken What is EMBL-EBI? Part of the European Molecular Biology Laboratory International, non-profit
More informationAgilent METLIN Personal Metabolite Database and Library MORE CONFIDENCE IN COMPOUND IDENTIFICATION
Agilent METLIN Personal Metabolite Database and Library MORE CONFIDENCE IN COMPOUND IDENTIFICATION COMPOUND IDENTIFICATION AT YOUR FINGERTIPS Compound identifi cation is a key element in untargeted metabolomics
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationReal Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report
Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationIntroduction to Deep Learning
Introduction to Deep Learning Some slides and images are taken from: David Wolfe Corne Wikipedia Geoffrey A. Hinton https://www.macs.hw.ac.uk/~dwcorne/teaching/introdl.ppt Feedforward networks for function
More informationCS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang
CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationMixture models for analysing transcriptome and ChIP-chip data
Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech,
More informationDeep Generative Models. (Unsupervised Learning)
Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What
More informationSupervised Machine Learning: Learning SVMs and Deep Learning. Klaus-Robert Müller!!et al.!!
Supervised Machine Learning: Learning SVMs and Deep Learning Klaus-Robert Müller!!et al.!! Today s Tutorial Machine Learning introduction: ingredients for ML Kernel Methods and Deep networks with explaining
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationExpression Data Exploration: Association, Patterns, Factors & Regression Modelling
Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation
More informationInferring Transcriptional Regulatory Networks from Gene Expression Data II
Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More information