Probabilistic Methods in Bioinformatics. Pabitra Mitra

Similar documents
Advanced classifica-on methods

Classification Part 2. Overview

7 Classification: Naïve Bayes Classifier

Machine Learning 2nd Edition

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Lecture Notes for Chapter 5 (PART 1)

Data Analysis. Santiago González

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Mining Classification Knowledge

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

DATA MINING: NAÏVE BAYES

DATA MINING LECTURE 10

Classification Lecture 2: Methods

Computational Genomics

CS6220: DATA MINING TECHNIQUES

Machine Learning for Signal Processing Bayes Classification and Regression

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STA 4273H: Statistical Machine Learning

Mining Classification Knowledge

CHAPTER-17. Decision Tree Induction

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

CS6220: DATA MINING TECHNIQUES

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Naïve Bayes classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

5. Discriminant analysis

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Pattern Recognition and Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

PATTERN CLASSIFICATION

Lecture 3: Pattern Classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

CS145: INTRODUCTION TO DATA MINING

Machine Learning for Signal Processing Bayes Classification

Mathematical Formulation of Our Example

Final Exam, Fall 2002

Multivariate statistical methods and data mining in particle physics

Machine Learning, Fall 2009: Midterm

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Statistical learning. Chapter 20, Sections 1 4 1

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Day 6: Classification and Machine Learning

Bayesian Networks Inference with Probabilistic Graphical Models

ECE521 week 3: 23/26 January 2017

Lecture Notes for Chapter 5. Introduction to Data Mining

Algorithmisches Lernen/Machine Learning

Data Preprocessing. Cluster Similarity

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CSCI-567: Machine Learning (Spring 2019)

Mobile Robot Localization

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Metric-based classifiers. Nuno Vasconcelos UCSD

The Bayes classifier

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Machine Learning Overview

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Naïve Bayesian. From Han Kamber Pei

Data Mining Part 4. Prediction

L11: Pattern recognition principles

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Chapter 6 Classification and Prediction (2)

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Lecture 3: Pattern Classification. Pattern classification

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning Lecture 2

Unsupervised Learning with Permuted Data

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayes Decision Theory - I

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Machine Learning Lecture 2

Mobile Robot Localization

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Chris Bishop s PRML Ch. 8: Graphical Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayesian Learning (II)

CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning 2017

p(x ω i 0.4 ω 2 ω

Bayesian Networks Structure Learning (cont.)

Ch 4. Linear Models for Classification

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Modern Information Retrieval

Introduction to Graphical Models

Transcription:

Probabilistic Methods in Bioinformatics Pabitra Mitra pabitra@cse.iitkgp.ernet.in

Probability in Bioinformatics Classification Categorize a new object into a known class Supervised learning/predictive analysis Regression Supervised prediction of continuous valued variables Clustering Extract homogenous groups in population Unsupervised learning/exploratory analysis Sequence analysis Relation and graph structure analysis

Probabilistic Algorithms Classification Bayes classification, graphical models Regression Logistic regression Clustering Gaussian mixture models Sequence analysis Markov models, hidden markov models, conditional random fields Relation analysis Markov processes, graph structure analysis many many more..

A Simple Species Classification Problem Measure the length of a fish, and decide its class Hilsa or Tuna

Collect Statistics Population for Class Hilsa Population for Class Tuna

Count/probability Count/probability Distribution of Fish Length HILSA Class Conditional Distributions P ( L = 1.75 ft HILSA ) P ( L = 4.25 ft TUNA ) length in feet 1ft TUNA Histograms length in feet 4ft

Decision Rule If length L B HILSA ELSE TUNA What should be the value of B ( boundary length)? Based on population statistics

Error of Decision Rule P( L HILSA) P( L TUNA) L B Errors: Type 1 + Type 2, Type 1: Actually Tuna, Classified as Hilsa (area under pink curve to the left of a B) Type 2: Actually Hilsa, Classified as Tuna (area under blue curve to the right of a B)

Optimal Decision Rule P( L HILSA) P( L TUNA) L B* B*: Optimal Value of B, (Optimal Decision Boundary) Minimum Possible Error P ( B* HILSA ) = P ( B* TUNA ) If Type 1 and Type 2 errors have different costs : optimal boundary shifts

Species Identification Problem Measure lengths of a (sizeable) population of Hilsa and Tuna fishes Estimate Class Conditional Distributions for Hilsa and Tuna classes respectively Find Optimal Decision Boundary B* from the distributions Apply Decision Rule to classify a newly caught (and measured) fish as either Hilsa or Tuna (with minimum error probability)

Location/Time of Experiment Calcutta in Monsoon More Hilsa few Tuna California in Winter More Tuna less Hilsa Even a 2ft fish is likely to be Hilsa in Calcutta (2000 Rs/Kilo!), a 1.5ft fish may be Tuna in California

Apriori Probability Without measuring length what can we guess about the class of a fish Depends on location/time of experiment Calcutta : Hilsa, California: Tuna Apriori probability: P(HILSA), P(TUNA) Property of the frequency of classes during experiment Not a property of length of the fish Calcutta: P(Hilsa) = 0.90, P(Tuna) = 0.10 California: P(Tuna) = 0.95, P(Hilsa) = 0.05 London: P(Tuna) = 0.50, P(Hilsa) = 0.50 Also a determining factor in class decision along with class conditional probability

Classification Decision We consider the product of Apriori and Class conditional probability factors Posteriori probability (Bayes rule) P(HILSA L = 2ft) = P(HILSA) x P(L=2ft HILSA) / P(L=2ft) Posteriori Apriori x Class conditional denominator is constant for all classes Apriori: Without any measurement - based on just location/time what can we guess about class membership (estimated frm size of class populations) Class conditional: Given the fish belongs to a particular class what is the probability that its length is L=2ft (estimated from population) Posteriori: Given the measurement that the length of the fish is L=2ft what is the probability that the fish belongs to a particular class (obtained using Bayes rule from above two probabilities). Useful in decision making using evidences/measurements.

Bayes Classification Rule (Bayes Classifier) Posteriori Distributions P( HILSA L) P( TUNA L) L B* B*: Optimal Value of B, (Bayes Decision Boundary) P ( HILSA L= B* ) = P ( TUNA L = B*) Minimum error probability: Bayes error

MAP Representation of Bayes Classifier Posteriori Distributions P( TUNA L) P( HILSA L) L Hilsa has higher posteriori probability than Tuna for this length Instead of finding decision boundary B*, state classification rule as: Classify an object in to the class for which it has the highest posteriori prob. (MAP: Maximum Aposteriori Probability)

MAP Multiclass Classifier P( HILSA L) Posteriori Distributions P( TUNA L) P( SHARK L) L Hilsa has highest posteriori probability among all classes for this length Classify an object in to the class for which it has the highest posteriori prob. (MAP: Maximum Aposteriori Probability)

Weight Multivariate Bayes Classifier Decision Boundary HILSA TUNA Length Feature or Attribute Space Class Seperability

Weight Decision Boundary: Normal Distribution Two spherical classes having different means, but same variance (diagonal covariance matrix with same variances) HILSA TUNA Length Decision Boundary: Perpendicular bisector of the mean vectors

Distances Two vectors: Euclidean, Minkowski etc A vector and a distribution: Mahalanobis, Bhattacharya Which distribution is closer to x? x s 2 s 1 2 ( x ) 1, dm ( X ) d ( X ) M s Between two distributions: Kullback-Liebler Divergence T

Weight Decision Boundary: Normal Distribution Two spherical classes having different means and variances (diagonal covariance matrix with different variances) HILSA TUNA Length Boundary: Locus of equi-mahalanobis distance points from the class distributions. (still a straight line)

Weight Decision Boundary: Normal Distribution Two elliptical classes having different means and variances (general covariance matrix with different variances) TUNA HILSA Class Boundary: Parabolic Length

Bayesian Classifiers Approach: compute the posterior probability P(C A 1, A 2,, A n ) for all values of C using the Bayes theorem P( C A A 1 2 A n ) P( A A A C) P( C) 1 2 n P( A A A ) 1 2 n Choose value of C that maximizes P(C A 1, A 2,, A n ) Equivalent to choosing value of C that maximizes P(A 1, A 2,, A n C) P(C) How to estimate P(A 1, A 2,, A n C )?

Estimating Multivariate Class Distributions Sample size requirement In a small sample: difficult to find a Hilsa fish whose length is 1.5ft and weight is 2 kilos, as compared to that of just finding a fish whose length is 1.5ft P(L=1.5, W=2 Hilsa), P(L=1.5 Hilsa) Curse of dimensionality Independence Assumption Assume length and weight are independent P(L=1.5, W=2 Hilsa) = P(L=1.5 Hilsa) x P(W=2 Hilsa) Joint distribution = product of marginal distributions Marginals are easier to estimate from a small sample

Naïve Bayes Classifier Assume independence among attributes A i when class is given: P(A 1, A 2,, A n C) = P(A 1 C j ) P(A 2 C j ) P(A n C j ) Can estimate P(A i C j ) for all A i and C j. New point is classified to C j if P(C j ) P(A i C j ) is maximal.

Example of Naïve Bayes Classifier Name Give Birth Can Fly Live in Water Have Legs Class human yes no no yes mammals python no no no no non-mammals salmon no no yes no non-mammals whale yes no yes no mammals frog no no sometimes yes non-mammals komodo no no no yes non-mammals bat yes yes no yes mammals pigeon no yes no yes non-mammals cat yes no no yes mammals leopard shark yes no yes no non-mammals turtle no no sometimes yes non-mammals penguin no no sometimes yes non-mammals porcupine yes no no yes mammals eel no no yes no non-mammals salamander no no sometimes yes non-mammals gila monster no no no yes non-mammals platypus no no no yes mammals owl no yes no yes non-mammals dolphin yes no yes no mammals eagle no yes no yes non-mammals A: attributes M: mammals N: non-mammals 6 6 2 2 P( A M ) 0.06 7 7 7 7 1 10 3 4 P( A N) 0.0042 13 13 13 13 7 P( A M ) P( M ) 0.06 0.021 20 13 P( A N) P( N) 0.004 0.0027 20 Give Birth Can Fly Live in Water Have Legs Class yes no yes no? P(A M)P(M) > P(A N)P(N) => Mammals

How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumption Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(A i c) k

Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: Original : P( A Laplace : P( A i i C) C) m - estimate : P( A i N N N N ic ic c c C) 1 c N N ic c mp m c: number of classes p: prior probability m: parameter

Bayes Classifier (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes Length and weight of a fish are not independent

Bayesian Belief Network A directed acyclic probablistic graphical model that captures dependence among the attributes Length Color Weight Species Nodes: Variable/Attributes/Class Directed edges: Causality Absence of edge: independence Network structure: domain knowledge Joint probabilities: from data

Nonparametric Statistics Do not assume parametric data distribution/model Take decisions based on given sample Bayesian statistics vs frequentist statistics

Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records

Nearest-Neighbor Classifiers Unknown record Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Nearest Neighbor Classification Compute distance between two points: Euclidean distance d( p, q) i ( p q i i ) 2 Determine the class from nearest neighbor list take the majority vote of class labels among the k-nearest neighbors Weigh the vote according to distance weight factor, w = 1/d 2

Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X

Nearest neighbor Classification k-nn classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree induction and rulebased systems Classifying unknown records are relatively expensive

DNA Coding Segment Identification Classes: Coding noncoding segment Attributes/features: sequence information Complex interdependence among attributes

Microarray Data Analysis Classes: Disease classes Attributes/features: gene expression levels Large number attributes, fewer samples

Protein Secondary Structure Prediction Classes: a-helix, coil etc Attributes/features: length, amino acid sequence, hydrophobicty, shape, ions Complex class distributions

Protein Interaction Prediction Classes: Binary Attributes/features: protein properties Presence of domain knowledge

References Pattern Classification, Duda, Hart and Stork, Wiley, 2010 Slides on data mining by Vipin Kumar

Questions!