Data Mining in Bioinformatics

Size: px

Start display at page:

Download "Data Mining in Bioinformatics"

Arron White
5 years ago
Views:

1 Data Mining in Bioinformatics Arlindo Oliveira Data Mining: Concepts and Techniques Data mining concepts Learning from examples Decision trees Neural networks Clustering 1

2 Typical Supervised Learning Problem Setting Given a set (database) of observations Each observation (x1,, xn, y) Xi are input variables Y is a particular output Build a model to predict y = f(x1,, xn) First define criterion to measure model quality Split dataset into training and test sets Build model using training set Validate model using test set A Database (Example) X1 X2 X3 X4 X5 X6 Y f(x1,,x6) O GOOD GOOD O GOOD GOOD O GOOD GOOD O GOOD GOOD O GOOD GOOD O GOOD GOOD O BAD BAD O GOOD GOOD O GOOD GOOD O GOOD BAD O GOOD GOOD O GOOD GOOD O BAD BAD O GOOD GOOD O GOOD GOOD O BAD GOOD O GOOD GOOD O GOOD GOOD O BAD BAD O GOOD GOOD O BAD BAD O GOOD GOOD O GOOD GOOD O GOOD GOOD O GOOD GOOD O GOOD GOOD 2

3 Main Steps Select subset of relevant input variables Build a model using these variables Generate sequence of models Identify one (or various) as being good models Use only the training set Validate selected models Quantitatively : using the test set Qualitatively : using expert knowledge Main Classes of Methods Supervised learning (= input/output models) Decision/regression trees Neural networks Unsupervised learning (=p(x1,,xn) models) Bayesian networks Clustering 3

4 Inductive Learning Learning from examples The general problem of inductive inference Inductive bias Examples Training Examples for Concept Enjoy Sport Concept: days on which my friend Aldo enjoys his favourite water sports Task: predict the value of Enjoy Sport for an arbitrary day based on the values of the other attributes Sky Sunny Sunny Rainy Sunny Temp Warm Warm Cold Warm Humid rmal High High High Wind Strong Strong Strong Strong instance Water Warm Warm Warm Cool Forecast Same Same Change Change Enjoy Sport 4

5 Inductive Learning Hypothesis Any hypothesis found to approximate the target function well over the training examples, will also approximate the target function well over the unobserved examples. Futility of Bias-Free Learning A learner that makes no prior assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances. Free Lunch! 5

6 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting What Is a Decision Tree? Value of X1 Small Medium or Large Value of X2 Y is big < 0.34 > 0.34 Y is very big Y is small 6

7 7 Training Examples Strong High Mild Rain D14 Weak rmal Hot Overcast D13 Strong High Mild Overcast D12 Strong rmal Mild Sunny D11 Strong rmal Mild Rain D10 Weak rmal Cold Sunny D9 Weak High Mild Sunny D8 Weak rmal Cool Overcast D7 Strong rmal Cool Rain D6 Weak rmal Cool Rain D5 Weak High Mild Rain D4 Weak High Hot Overcast D3 Strong High Hot Sunny D2 Weak High Hot Sunny D1 Play Tennis Wind Humidity Temp. Outlook Day Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity High rmal Wind Strong Weak

8 Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High rmal Each branch corresponds to an attribute value node Each leaf node assigns a classification Decision Tree for PlayTennis Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak? Outlook Sunny Overcast Rain Humidity Wind High rmal Strong Weak 8

9 Decision Tree for Conjunction Outlook=Sunny Wind=Weak Outlook Sunny Overcast Rain Wind Strong Weak Decision Tree for Disjunction Outlook=Sunny Wind=Weak Outlook Sunny Overcast Rain Wind Wind Strong Weak Strong Weak 9

10 Decision Tree for XOR Outlook=Sunny XOR Wind=Weak Outlook Sunny Overcast Rain Wind Wind Wind Strong Weak Strong Weak Strong Weak Decision Tree decision trees represent disjunctions of conjunctions Outlook Sunny Overcast Rain Humidity Wind High rmal Strong Weak (Outlook=Sunny Humidity=rmal) (Outlook=Overcast) (Outlook=Rain Wind=Weak) 10

11 When to consider Decision Trees Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Missing attribute values Examples: Medical diagnosis Credit risk analysis Object classification for robot manipulator (Tan 1993) Growing and Pruning Pictorially Data mis-fit Underfitting Overfitting Pruning Growing Tree complexity Final tree 11

12 An Application in Bioinformatics Genetics of complex traits Data base (see) Composed of observations on 1086 animals Inputs : 20x2 genetic markers Outputs : phenotypic measurements (numbers) Identify the location of involved chromosomal regions Results : unpruned and pruned Another Application in Bioinformatics Identification of protein origin Data base (see) Composed of frequency of aminoacids in different families Inputs : 20 frequencies Outputs : class of protein Objective: identify the family of the protein 12

13 Yet Another Application in Bioinformatics Identification of regulatory mechanisms between yeast genes Data from microarray experiments [Spellman et al., (1998). Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology of the Cell 9, ] Want to predict which genes activate: CLN1, CLN2, CLN3, SW14 Decision tree for CNL1 activation YPL256C CLN2 <-0,375-0,375 CLN1 Não activo YPL120C CLB5 <-0,285-0,285 CLN1 Não activo YDR328C SKP1 <0,695 0,695 CLN1 CLN1 Activo Não activo % 1 0% Confusion matrix 1 22% 100% 13

14 Decision tree for CLN2 activation CLB5 <0 0 CLN2 Activo CLN3 <-0,455-0,455 CLN2 Não activo CLN2 Activo CDH1 <-0,475-0,475 CLN2 Não activo ,6% 1 20% Confusion matrix 1 33,3% 80% Decision tree for CLN3 activation YGL003 CDH1 < CLN3 Não activo SKP1 < CLN3 CDC53 Não activo <0.025 >0,025 CLN3 CLN3 Não activo Activo ,3% 16,6% 1 14,2% 85,7% Confusion matrix 14

15 Decision tree for SW14 activation MBP1 < <-0,28 SW14 Não activo MCM1-0,28 CLB1 SIC1 CLN2 < SW14 Activo % 9% 1 37% 90% <1,24 >1,24 SW14 Activo SW14 Não activo <0.025 >0,025 SW14 Não activo SW14 Activo Confusion matrix Top-Down Induction of Decision Trees ID3 1. A the best decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A create new descendant 4. Sort training examples to leaf node according to the attribute value of the branch 5. If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes. 15

16 Which Attribute is best? [29+,35-] A 1 =? A 2 =? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] Entropy S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p + log 2 p + - p - log 2 p - 16

17 Entropy Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign log 2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p + log 2 p + - p - log 2 p - Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - v values(a) S v / S Entropy(S v ) Entropy([29+,35-]) = -29/64 log 2 29/64 35/64 log 2 35/64 = 0.99 [29+,35-] A 1 =? A 2 =? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] 17

18 Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A 1 )=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Information Gain Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A 2 )=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 [29+,35-] A 1 =? A 2 =? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Training Examples Outlook Temp. Humidity Wind Sunny Hot High Weak Sunny Hot High Strong Overcast Hot High Weak Rain Mild High Weak Rain Cool rmal Weak Rain Cool rmal Strong Overcast Cool rmal Weak Sunny Mild High Weak Sunny Cold rmal Weak Rain Mild rmal Strong Sunny Mild rmal Strong Overcast Mild High Strong Overcast Hot rmal Weak Rain Mild High Strong Play Tennis 18

19 Selecting the Next Attribute S=[9+,5-] E=0.940 Humidity S=[9+,5-] E=0.940 Wind High rmal Weak Strong [3+, 4-] [6+, 1-] E=0.985 Gain(S,Humidity) =0.940-(7/14)*0.985 (7/14)*0.592 =0.151 E=0.592 [6+, 2-] [3+, 3-] E=0.811 E=1.0 Gain(S,Wind) =0.940-(8/14)*0.811 (6/14)*1.0 =0.048 Selecting the Next Attribute Sunny S=[9+,5-] E=0.940 Outlook Rain [2+, 3-] [4+, 0] [3+, 2-] E=0.971 Over cast E=0.0 E=0.971 Gain(S,Outlook) =0.940-(5/14)* (4/14)*0.0 (5/14)* =

20 [D1,D2,,D14] [9+,5-] ID3 Algorithm Outlook Sunny Overcast Rain S sunny =[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-]?? [D4,D5,D6,D10,D14] [3+,2-] Gain(S sunny, Humidity)=0.970-(3/5)0.0 2/5(0.0) = Gain(S sunny, Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = Gain(S sunny, Wind)=0.970= -(2/5)1.0 3/5(0.918) = ID3 Algorithm Outlook Sunny Overcast Rain Humidity [D3,D7,D12,D13] Wind High rmal Strong Weak [D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10] 20

21 Hypothesis Space Search ID3 A A A2 A A A4 Hypothesis Space Search ID3 Hypothesis space is complete! Target function surely in there Outputs a single hypothesis backtracking on selected attributes (greedy search) Local minimal (suboptimal splits) Statistically-based search choices Robust to noisy data Inductive bias (search bias) Prefer shorter trees over longer ones Place high information gain attributes close to the root 21

22 Inductive Bias in ID3 H is the power set of instances X Unbiased? Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of the hypothesis space H Occam s razor: prefer the shortest (simplest) hypothesis that fits the data Occam s Razor Why prefer short hypotheses? Argument in favor: Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a coincidence A long hypothesis that fits the data might be a coincidence Argument opposed: There are many ways to define small sets of hypotheses E.g. All trees with a prime number of nodes that use attributes beginning with Z What is so special about small sets based on size of hypothesis 22

23 Overfitting in Decision Tree Learning Avoid Overfitting How can we avoid overfitting? Stop growing when data split not statistically significant Grow full tree then post-prune Minimum description length (MDL): Minimize: size(tree) + size(misclassifications(tree)) 23

24 Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves the validation set accuracy Produces smallest version of most accurate subtree Effect of Reduced Error Pruning 24

25 Rule-Post Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of each other 3. Sort final rules into a desired sequence to use Method used in C4.5 Converting a Tree to Rules Outlook Sunny Overcast Rain Humidity Wind High rmal Strong Weak R 1 : If (Outlook=Sunny) (Humidity=High) Then PlayTennis= R 2 : If (Outlook=Sunny) (Humidity=rmal) Then PlayTennis= R 3 : If (Outlook=Overcast) Then PlayTennis= R 4 : If (Outlook=Rain) (Wind=Strong) Then PlayTennis= R 5 : If (Outlook=Rain) (Wind=Weak) Then PlayTennis= 25

26 Continuous Valued Attributes Create a discrete attribute to test continuous Temperature = C (Temperature > C) = {true, false} Where to set the threshold? Temperatur 15 0 C 18 0 C 19 0 C 22 0 C 24 0 C 27 0 C PlayTennis (see paper by [Fayyad, Irani 1993] Attributes with many Values Problem: if an attribute has many values, maximizing InformationGain will select it. E.g.: Imagine using Date= as attribute perfectly splits the data into subsets of size 1 Use GainRatio instead of information gain as criteria: GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A) SplitInformation(S,A) = -Σ i=1..c S i / S log 2 S i / S Where S i is the subset for which attribute A has the value v i 26

27 Attributes with Cost Consider: Medical diagnosis : blood test costs 1000 SEK Robotics: width_from_one_feet has cost 23 secs. How to learn a consistent tree with low expected cost? Replace Gain by : Gain 2 (S,A)/Cost(A) [Tan, Schimmer 1990] 2 Gain(S,A) -1/(Cost(A)+1) w w [0,1] [Nunez 1988] Unknown Attribute Values What is some examples missing values of A? Use training example anyway sort through tree If node n tests A, assign most common value of A among other examples sorted to node n. Assign most common value of A among other examples with same target value Assign probability pi to each possible value vi of A Assign fraction pi of example to each descendant in tree Classify new examples in the same fashion 27

28 Cross-Validation Estimate the accuracy of a hypothesis induced by a supervised learning algorithm Predict the accuracy of a hypothesis over future unseen instances Select the optimal hypothesis from a given set of alternative hypotheses Pruning decision trees Model selection Feature selection Combining multiple classifiers (boosting) Holdout Method Partition data set D = {(v 1,y 1 ),,(v n,y n )} into training D t and validation set D h =D\D t Training D t Validation D\D t acc h = 1/h (vi,yi) Dh δ(i(d t,v i ),y i ) I(D t,v i ) : output of hypothesis induced by learner I trained on data D t for instance v i δ(i,j) = 1 if i=j and 0 otherwise Problems: makes insufficient use of data training and validation set are correlated 28

29 Cross-Validation k-fold cross-validation splits the data set D into k mutually exclusive subsets D 1,D 2,,D k D 1 D 2 D 3 D 4 Train and test the learning algorithm k times, each time it is trained on D\D i and tested on D i D 1 D 2 D 3 D 4 D 1 D 2 D 3 D 4 D 1 D 2 D 3 D 4 D 1 D 2 D 3 D 4 acc cv = 1/n (vi,yi) D δ(i(d\d i,v i ),y i ) Cross-Validation Uses all the data for training and testing Complete k-fold cross-validation splits the dataset of size m in all (m over m/k) possible ways (choosing m/k instances out of m) Leave n-out cross-validation sets n instances aside for testing and uses the remaining ones for training (leave one-out is equivalent to n-fold crossvalidation) In stratified cross-validation, the folds are stratified so that they contain approximately the same proportion of labels as the original data set 29

30 Neural networks Perceptrons Gradient descent Multi-layer networks Backpropagation Biological Neural Systems Neuron switching time : > 10-3 secs Number of neurons in the human brain: ~10 10 Connections (synapses) per neuron : ~ Face recognition : 0.1 secs High degree of parallel computation Distributed representations 30

31 Properties of Artificial Neural Nets (ANNs) Many simple neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed processing Learning by tuning the connection weights Appropriate Problem Domains for Neural Network Learning Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Form of target function is unknown Humans do not need to interpret the results (black box model) 31

32 Perceptron Linear treshold unit (LTU) x 1 x 2. x n w 2 w n w 1 x 0=1 w 0 Σ Σ i=0n w i x i o(x i )= { 1 if Σ i=0 n w i x i >0-1 otherwise o Decision Surface of a Perceptron + + x x 1 Perceptron is able to represent some useful functions And(x 1,x 2 ) choose weights w 0 =-1.5, w 1 =1, w 2 =1 But functions that are not linearly separable (e.g. Xor) are not representable + - x x 1 32

33 Perceptron Learning Rule w i = w i + w i w i = η (t - o) x i t=c(x) is the target value o is the perceptron output η Is a small constant (e.g. 0.1) called learning rate If the output is correct (t=o) the weights w i are not changed If the output is incorrect (t o) the weights w i are changed such that the output of the perceptron for the new weights is closer to t. The algorithm converges to the correct classification if the training data is linearly separable and η is sufficiently small Perceptron Learning Rule t=1 w=[ ] x 2 = 0.2 x (x,t)=([-1,-1],1) (x,t)=([2,1],-1) o=sgn( ) o=sgn( ) (x,t)=([1,1],1) o=sgn( ) =-1 =1 =-1 w=[-0.2 w=[ ] w=[ ] o=1 o=-1 t=-1 33

34 Gradient Descent Learning Rule Consider linear unit without threshold and continuous output o (not just 1,1) o=w 0 + w 1 x w n x n Train the w i s such that they minimize the squared error E[w 1,,w n ] = ½ Σ d D (t d -o d ) 2 where D is the set of training examples Gradient Descent D={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>} Gradient: E[w]=[ E/ w 0, E/ w n ] w=-η E[w] (w 1,w 2 ) (w 1 + w 1,w 2 + w 2 ) w i =-η E/ w i = / w i 1/2Σ d (t d -o d ) 2 = / w i 1/2Σ d (t d -Σ i w i x i ) 2 = Σ d (t d - o d )(-x i ) 34

35 Gradient Descent Gradient-Descent(training_examples, η) Each training example is a pair of the form <(x 1, x n ),t> where (x 1,,x n ) is the vector of input values, and t is the target output value, η is the learning rate (e.g. 0.1) Initialize each w i to some small random value Until the termination condition is met, Do Initialize each w i to zero For each <(x 1, x n ),t> in training_examples Do Input the instance (x 1,,x n ) to the linear unit and compute the output o For each linear unit weight w i Do w i = w i + η (t-o) x i For each linear unit weight wi Do w i =w i + w i Incremental Stochastic Gradient Descent Batch mode : gradient descent w=w - η E D [w] over the entire data D E D [w]=1/2σ d (t d -o d ) 2 Incremental mode: gradient descent w=w - η E d [w] over individual training examples d E d [w]=1/2 (t d -o d ) 2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η is small enough 35

36 Comparison Perceptron and Gradient Descent Rule Perceptron learning rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate η Linear unit training rules uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by H Multi-Layer Networks output layer hidden layer input layer 36

37 x 1 x 2. x n w 2 w n w 1 x 0=1 w 0 Σ Sigmoid Unit net=σ i=0n w i x i o=σ(net)=1/(1+e -net ) σ(x) is the sigmoid function: 1/(1+e -x) dσ(x)/dx= σ(x) (1- σ(x)) Derive gradient decent rules to train: one sigmoid function E/ w i = -Σ d (t d -o d ) o d (1-o d ) x i Multilayer networks of sigmoid units backpropagation: o Backpropagation Algorithm Initialize each w i to some small random value Until the termination condition is met, Do For each training example <(x 1, x n ),t> Do Input the instance (x 1,,x n ) to the network and compute the network outputs o k For each output unit k δ k =o k (1-o k )(t k -o k ) For each hidden unit h δ h =o h (1-o h ) Σ k w h,k δ k For each network weight w,j Do w i,j =w i,j + w i,j where w i,j = η δ j x i,j 37

38 Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum -in practice often works well (can be invoked multiple times with different initial weights) Often include weight momentum term w i,j (n)= η δ j x i,j + α w i,j (n-1) Minimizes error training examples Will it generalize well to unseen instances (over-fitting)? Training can be slow typical iterations (use Levenberg-Marquardt instead of gradient descent) Using network after training is fast Binary Encoder -Decoder 8 inputs 3 hidden 8 outputs Hidden values

39 Sum of Squared Errors for the Output Units Hidden Unit Encoding for Input

40 Convergence of Backprop Gradient descent to some local minimum Perhaps not global minimum Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses Expressive Capabilities of ANN Boolean functions Every boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] 40

41 What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms Examples of Clustering Applications in Bioinformatics MEME, a motif discovery algorithm using EM Clusters patterns (base sequences) based on their similarity COGs, Clusters of Orthologous Groups Identifies commonality between genes GeneShaving, clustering gene expression data Given a gene expression matrix of N p, similar expression patterns are clustered. 41

Microarrays Rows represent genes Columns represent samples Many problems may be solved using clustering Example of microarray dataset Microarray data S j Expression

42 Microarrays Rows represent genes Columns represent samples Many problems may be solved using clustering Example of microarray dataset Microarray data S j Expression levels of gene i, across samples G i Expression levels of all genes, for one sample Typical examples of samples: Heat shock, phases in cell cycle, cancer, normal, 42

43 Things to study (1) Clustering (grouping) genes: i.e., finding groups of co-regulated genes Groups of similar behaviour? Example: Expression levels across time of two clusters of co-regulated genes samples samples Things to study (2) Clustering (grouping) samples i.e., finding groups of samples with similar genetic profiles (e.g., cancer types). Groups of similar behaviour? 43

Things to study (3) Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their expression profiles across samples.

44 Things to study (3) Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their expression profiles across samples. Annotated gene 1 Unknown gene samples samples Annotated gene 2 samples Co-regulation? Similar biological function? Same transcription factor? Things to study (4) Classifying samples: i.e., classifying new samples, based on a set of classified samples (example: cancer versus normal; different types of cancer;...) classified samples A B samples to be classified 44

45 Things to study (5) Selecting genes: a) deciding if a given gene, in isolation, behaves differently in a control versus experimental situation (e.g., cancer vs normal, two types of cancer, treatment vs non-treatment). b) Selecting which group genes is significantly different in a control versus experimental situation (same examples). c) Selecting which group of genes is relevant for a given classification problem. Hierarchical (agglomerative) clustering. Strictly speaking, agglomerative clustering does not produce clusters, but a dendogram dissimilarity Cutting the dendogram at a certain level yields clusters. Dendogram cutting is a problem analogous to the selection of K in K-means clustering. 45

46 Example of agglomerative gene clustering (Eisen et al, 98) Microarray data from time course of serum stimulation of primary human fibroblasts. Experiment: Foreskin fibroblasts were grown in culture and were deprived of serum for 48 hr. Serum was added back and samples taken at time 0, 15 min, 30 min, 1hr, 2 hr, 3 hr, 4 hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr. Clustering: Agglomerative clustering Correlation Coefficient + (average-link) Clusters with biological interpretation: (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate-early response, (D) signaling and angiogenesis, (E) wound healing and tissue remodelling. Data Structures Data matrix x x i1... x n x 1f... x if... x nf x 1p... x ip... x np Dissimilarity matrix 0 d(2,1) d(3,1 ) : d ( n,1) 0 d (3,2) : d ( n,2) 0 :

47 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment 47

48 The K-Means Clustering Method Example Assign each objects to most similar center reassign Update the cluster means reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. rmally, k, t << n. Comparing: PAM: O(k(n-k) 2 ), CLARA: O(ks 2 + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers t suitable to discover clusters with non-convex shapes 48

49 Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang 98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method What is the problem of k-means Method? The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

50 The K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets Typical k-medoids algorithm (PAM) Total Cost = K=2 Arbitrary choose k object as initial medoids Total Cost = 26 Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,o ramdom Do loop Until no change Swapping O and O ramdom If quality is improved Compute total cost of swapping

51 PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987), built in Splus Use real object to represent the cluster Select k representative objects arbitrarily For each pair of non-selected object h and selected object i, calculate the total swapping cost TC ih For each pair of i and h, If TC ih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change What is the problem with PAM? Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean Pam works efficiently for small data sets but does not scale well for large data sets. O(k(n-k) 2 ) for each iteration where n is # of data,k is # of cluster 51

Typical Supervised Learning Problem Setting

Typical Supervised Learning Problem Setting Given a set (database) of observations Each observation (x1,, xn, y) Xi are input variables Y is a particular output Build a model to predict y = f(x1,, xn)