Data Analysis. Santiago González

Size: px

Start display at page:

Download "Data Analysis. Santiago González"

Eric Foster
6 years ago
Views:

1 Santiago González

2 Contents Introduction CRISP-DM (1) Tools Data understanding Data preparation Modeling (2) Association rules? Supervised classification Clustering Assesment & Evaluation (1) Examples: (2) Neuron Classification Alzheimer disease Meduloblastoma CliDaPa (1) Special Guest Prof. Ernestina Menasalvas Stream Mining

3 Data Mining: Modeling

4 Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

5 Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive] Supervised cl. Unsupervised cl.

6 Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive]

7 Association Rule Discovery Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

8 Association Rule Discovery Example: Let the rule discovered be {Bagels, } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

9 Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive]

10 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class (categorical). Class may be binary o not Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A testing set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

11 10 10 Classification Example Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K? 2 No Married 100K No Yes Married 50K? 3 No Single 70K No No Married 150K? 4 Yes Married 120K No Yes Divorced 90K? 5 No Divorced 95K Yes No Single 40K? 6 No Married 60K No 7 Yes Divorced 220K No No Married 80K? Test Set 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Training Set Learn Classifier Model

Classifying Galaxies Early Class: Stages of Formation Intermediate Courtesy: http://aps.umn.

12 Classifying Galaxies Early Class: Stages of Formation Intermediate Courtesy: Attributes: Image features, Characteristics of light waves received, etc. Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

13 Classification

14 Cross validation Well classified: (a+d)/sum Wrong classified: (c+b)/sum True positive (sensibility): a/a+c True negative (specificity): d/b+d False positive: b/a+c False negative: c/b+d

15 Classification: example Well classified: Wrong classified: True positive (sensibility): True negative (specificity): False positive: False negative:

16 Classification: example Well classified: 4/6 Wrong classified: 2/6 True positive (sensibility): 2/3 True negative (specificity): 2/3 False positive: 1/3 False negative: 1/3

17 Classification

18 KNN Idea: use information of the k nearest neighbours. We need to calculate the distance between samples in order to know who is nearest (euclidea, manhattan, etc.) Prior info: Number of neighbours: K Distance function: d(x,y) Learning data Testing data

19 KNN Euclidean distance Manhattan distance Quite similar Difference: absolute value instead of squared value

20 KNN Example with K = 3, two attributes and euclidean distance

21 ID3 Objective: Create a decision tree as a method to approximate a target function based on discrete values Resistant to noise in the data Is able to find or learn of a disjunction of expressions. Result can be expressed as rules: if-then Try to find the simplest tree that separe better the samples. It is a recursive algorithm Use information gain

22 ID3

23 ID3 The most discriminative feature is the one with more Information Gain: G (C,Attr 1 ) = E (C) - P(C Attr 1 =V i ) * E (Attr 1 ) where E (Attr 1 ) = - P(Attr 1 =V i ) * log 2 (P(Attr 1 =V i )) = = - P(Attr 1 =V i ) * ln(p(attr 1 =V i )) / ln(2)

24 ID3: example This feature is important?? Clasificación Supervisada

25 ID3: example G(AdministrarTratamiento,Gota) = G(AT,G) G(AT,G) = E(AT) P(G=Si) x E(G=Si) P(G=No) x E(G=No) E(G=Si) = - P(AT=Si G=Si) * log 2 (P(AT=Si G=Si)) - P(AT=No G=Si) * log 2 (P(AT=No G=Si)) = = - 3/7 * log 2 (3/7) 4/7 * log 2 (4/7) = E(G=No) = - P(AT=Si G=No) * log 2 (P(AT=Si G=No)) - P(AT=No G=No) * log 2 (P(AT=No G=No)) = - 6/7 * log 2 (6/7) 1/7 * log 2 (1/7) = E(AT)=- P(AT=Si)* log 2 (P(AT=Si)) - P(AT=No)* log 2 (P(AT=No)) = = - 9/14 * log 2 (9/14) - 5/14 * log 2 (5/14) = G(AT,G) = 0.94 P(G=Si) x P(G=No) x = = 0.94 (7/14) x (7/14) x = 0.151

26 ID3: example

27 ID3: example

28 ID3: example

29 Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem: ) ( ) ( ) ( ) ( A P C P C A P A C P ) ( ), ( ) ( ) ( ), ( ) ( C P A C P C A P A P A C P A C P

30 Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what s the probability he/she has meningitis? P( S M ) P( M ) 0.51/ P( M S) P( S) 1/ 20

31 Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) Can we estimate P(C A 1, A 2,,A n ) directly from data?

32 Bayesian Classifiers Approach: compute the posterior probability P(C A 1, A 2,, A n ) for all values of C using the Bayes theorem P( C A A 1 2 A Choose value of C that maximizes P(C A 1, A 2,, A n ) n ) P( A A A C) P( C) 1 2 n P( A A A ) Equivalent to choosing value of C that maximizes P(A 1, A 2,, A n C) P(C) 1 2 n How to estimate P(A 1, A 2,, A n C )?

33 Naïve Bayes Classifier Assume independence among attributes A i when class is given: P(A 1, A 2,, A n C) = P(A 1 C j ) P(A 2 C j ) P(A n C j ) Can estimate P(A i C j ) for all A i and C j. New point is classified to C j if P(C j ) P(A i C j ) is maximal.

34 10 How to Estimate Probabilities from Data? Class: P(C) = N c /N categorica l Tid Refund Marital Status categorica l Taxable Income continuous Evade class e.g., P(No) = 7/10, P(Yes) = 3/10 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes For discrete attributes: P(A i C k ) = A ik / N c where A ik is number of instances having attribute A i and belongs to class C k Examples: P(Status=Married No) = 4/7 P(Refund=Yes Yes)=0 k

35 How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumption Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(A i c)

36 10 How to Estimate Probabilities from Data? categorica l Tid Refund Marital Status categorica l Taxable Income continuous 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Evade 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes class Normal distribution: P( A c ) One for each (A i,c i ) pair For (Income, Class=No): If Class=No i j 2 ( sample mean = 110 ) 2 sample variance = ij e A i ij 2 ij 2 P( Income 120 No) 1 2 (54.54) e (120110) 2(2975)

37 Example of Naïve Bayes Classifier naive Bayes Classifier: Given a Test Record: X ( Refund No, Married, Income 120K) P(Refund=Yes No) = 3/7 P(Refund=No No) = 4/7 P(Refund=Yes Yes) = 0 P(Refund=No Yes) = 1 P(Marital Status=Single No) = 2/7 P(Marital Status=Divorced No)=1/7 P(Marital Status=Married No) = 4/7 P(Marital Status=Single Yes) = 2/7 P(Marital Status=Divorced Yes)=1/7 P(Marital Status=Married Yes) = 0 For taxable income: If class=no: sample mean=110 sample variance=2975 If class=yes: sample mean=90 sample variance=25 P(X Class=No) = P(Refund=No Class=No) P(Married Class=No) P(Income=120K Class=No) = 4/7 4/ = P(X Class=Yes) = P(Refund=No Class=Yes) P(Married Class=Yes) P(Income=120K Class=Yes) = = 0 Since P(X No)P(No) > P(X Yes)P(Yes) Therefore P(No X) > P(Yes X) => Class = No

38 Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: Nic Original : P( Ai C) N Laplace : P( A i C) m - estimate : P( A i N N ic c c C) 1 c N N ic c mp m c: number of classes p: prior probability m: parameter

39 Example of Naïve Bayes Classifier Name Give Birth Can Fly Live in Water Have Legs Class human yes no no yes mammals python no no no no non-mammals salmon no no yes no non-mammals whale yes no yes no mammals frog no no sometimes yes non-mammals komodo no no no yes non-mammals bat yes yes no yes mammals pigeon no yes no yes non-mammals cat yes no no yes mammals leopard shark yes no yes no non-mammals turtle no no sometimes yes non-mammals penguin no no sometimes yes non-mammals porcupine yes no no yes mammals eel no no yes no non-mammals salamander no no sometimes yes non-mammals gila monster no no no yes non-mammals platypus no no no yes mammals owl no yes no yes non-mammals dolphin yes no yes no mammals eagle no yes no yes non-mammals Give Birth Can Fly Live in Water Have Legs Class yes no yes no? A: attributes M: mammals N: non-mammals P( A M ) P( A N) P( A M ) P( M ) P( A N) P( N) P(A M)P(M) > P(A N)P(N) => Mammals

40 Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive]

41 Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advetising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.

42 Regression

43 Data Mining Tasks... Association Rule Discovery [Descriptive] Classification [Predictive] Regression [Predictive] Clustering [Descriptive]

44 Clustering Definition A clustering is a set of clusters Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

45 Illustrating Clustering Intracluster distances are minimized Intercluster distances are maximized Euclidean Distance Based Clustering in 3-D space.

46 Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters

47 Clustering

48 Types of Clusterings Important distinction between hierarchical, partitional and density sets of clusters Partitional Clustering (K-Means) A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering (Agglomerative) A set of nested clusters organized as a hierarchical tree Density clustering (DBSCAN) Clusters are regarded as regions in the data space in which the objects are dense, and which are separated by regions of low object density (noise).

49 Partitional Clustering Original Points A Partitional Clustering

50 K-Means Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple

51 K-Means Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. Closeness is usually measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to Until relatively few points change clusters

52 y Importance of Choosing Initial Centroids 3 Iteration x

53 y y y y y y Importance of Choosing Initial Centroids 3 Iteration 1 3 Iteration 2 3 Iteration x x x 3 Iteration 4 3 Iteration 5 3 Iteration x x x

54 Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits

55 Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram p1 p3 p4 p2 p1 p2 p3 p4 Non-traditional Hierarchical Clustering Non-traditional Dendrogram

56 DBSCAN Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes

57 Data Mining: Assesment

58 Assesment Algorithms Supervised Metrics Validation Algorithms Unsupervised Metrics

59 Supervised validation alg. Resubstitution

60 Supervised validation alg. Hold-out

61 Supervised validation alg. N-fold cross validation

62 Supervised validation alg. Leave-one-out (N max folds) N-cross fold validation cuando N = dim(datos)

63 Supervised validation alg Bootstrap Clasificación Supervisada

64 Supervised metrics Calibration Distance between real class and predited class. Continuous [0, ) Discrimination Probability of classification Continuous [0,1] In classification, we want to get the lowest calibration possible and the highest discrimination possible.

65 Página 65 Supervised metrics Example: Real class: 1 Predicted class: 0.6 (using regression) Discrimination: 1 supossing that if Class predicted > 0.5 then Class predicted = 1 Calibration: 0.4 (1-0.6)

66 Supervised metrics Accuracy (well classified) [Discrimination] Log Likelihood [Calibration] AUC [Discrimination] Brier Score [Calibration + Discrimination] Hosmer DW, Lemeshow S (2000) Applied logistic regression 2nd edn. Wiley, New York

67 AUC Area Under the ROC Curve Continuous [0,1]

68 Unsupervised validation

69 Unsupervised alg. Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized. Separation, the clusters themselves should be widely spaced. There are three common approaches measuring the distance between two different clusters: Single linkage: It measures the distance between the closest members of the clusters. Complete linkage: It measures the distance between the most distant members. Comparison of centroids: It measures the distance between the centers of the clusters. MARIA HALKIDI, YANNIS BATISTAKIS and MICHALIS VAZIRGIANNIS On Clustering Validation Techniques, Journal of IIS, 2001

70 Measures of Cluster Validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types. External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Relative Index: Used to compare two different clusters. Often an external or internal index is used for this function, e.g., SSE or entropy MARIA HALKIDI, YANNIS BATISTAKIS and MICHALIS VAZIRGIANNIS On Clustering Validation Techniques, Journal of IIS, 2001

y Points Using Similarity Matrix for Cluster Validation Order the similarity matrix with respect to cluster labels and inspect visually. 1 0.9 0.8 0.7 0.6 0.

71 y Points Using Similarity Matrix for Cluster Validation Order the similarity matrix with respect to cluster labels and inspect visually x Complete Link Points Similarity

Points y Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp 10 20 30 40 50 60 70 80 90 1 0.9 0.8 0.

72 Points y Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp Points 0 Similarity Complete Link x

Using Similarity Matrix for Cluster Validation 1 1 2 6 500 0.9 0.8 4 3 1000 0.7 0.

73 Using Similarity Matrix for Cluster Validation DBSCAN

74 Santiago González

Advanced classifica-on methods

Advanced classifica-on methods Instance-based classifica-on Bayesian classifica-on Instance-Based Classifiers Set of Stored Cases Atr1... AtrN Class A B B C A C B Store the training records Use training