n Classify what types of customers buy what products n What makes a pattern interesting? n easily understood

Size: px

Start display at page:

Download "n Classify what types of customers buy what products n What makes a pattern interesting? n easily understood"

Barnard Wilkins
5 years ago
Views:

1 Data Mining/Knowledge Discovery/ Intelligent Data Analysis Task: discovering interesting patterns or structures in large amounts of data. Multidisciplinary: machine learning, statistics and databases Algorithms: construct particular types of models or patterns, rely on a built-in bias to simpler or more explanatory models, are designed to be efficient, require interestingness or evaluation criteria to separate pattern from noise. Applications Marketing Market basket associations between product sales, e.g., beer and diapers Classify what types of customers buy what products Auto Insurance Fraud Detect people who stage accidents to collect on insurance Money Laundering Since 1993, the US Treasury's Financial Crimes Enforcement Network agency has used a data-mining application to detect suspicious money transactions Spam filtering, flu or earthquake detection, ad tailoring, social trends, recommendations, uncovering secrets Key Issue: Feedback Supervised: ground truth or correct answer is provided by a teacher Reinforcement: program receives some evaluation, but is not told the correct answer Unsupervised: correct answer must be inferred, no hints Key Issue: Interestingness What makes a pattern interesting? easily understood valid on new data within some certainty useful novel Measures statistical significance or frequency recall vs. precision (how well are data characterized?) model specific 1

2 Assessing Performance: Separating Training and Testing The set of examples is divided into two sets: training and testing. Varying the size of randomly selected training/testing sets produces a learning curve. 3 rd set for validation: used to optimize parameters or select model prior to testing Assessing Performance: Cross-Validation addresses overfitting when model matches data too closely to permit generalizations. K-fold Cross-validation (aka Leave-One-Out ) 1. Shuffle items in the problem set. 2. Divide set into k equal sized parts (aka folds ) 3. Do i=1 to k times: 1. Call the ith set the test set and put aside. 2. Train the system on the other k-1 sets; test on the ith set and record performance 3. Reset 4. Calculate average performance for the k tests Assessing Performance: Stratification Cross validation random folds assume the data includes all of the classes. But for rare classes, can get unlucky. Stratification is when folds are selected so that each class is represented in both training and testing Assessing Performance: Bootstrap Sampling with replacement Good for small datasets bootstrap (approximate % of instances that make it into the training set) 1. Collect n samples from dataset of size n with replacement which becomes the training set 2. Instances not sampled become the testing set 2

3 Data Mining Models Classifiers Supervised: training data includes known class 1R, Decision Trees, Naïve Bayes Unsupervised: training data do not have correct answer Clusters Patterns in Logs: Association rules Sequence detection Classification: Generalizing from Examples Given a collection of examples (input and classification pairs), return a function that approximates the true classification. Typically, data are vectors of feature values (e.g., Vector model of terms in Information Retrieval). Linear Spline Alternate Classification Models Bias is any preference for one hypothesis over another beyond mere consistency of examples. Hard part is determining what makes a good classification representation. Ockham s razor: the most likely hypothesis is the simplest one that is consistent with the observations. Supervised I: Nearest-Neighbors Select a classification for a new point based on the labels for the k nearest neighbors For discrete, take majority vote. For continuous, average, do linear or weighted regression. Issues: How to pick k? How to compute distances? 3

Nearest-Neighbors Example Making Nearest Neighbor Efficient 50 45 40 35 30 25 20 15 10 5 0 0 10 20 30 40 50 Obvious method: linear scan of data Efficient data structures kd-trees: binary tree that

4 Nearest-Neighbors Example Making Nearest Neighbor Efficient Obvious method: linear scan of data Efficient data structures kd-trees: binary tree that recursively constructs hyperplane partitions. It stores points in k dimensional space where k is number of attributes. Ball trees: defines k-dimensional hyperspheres, arranged in a tree Making Nearest Neighbor Efficient: example kd-trees Supervised II: 1R (7,4) horizontal a2 (3,8) (6,7) Rule on one attribute Most frequent class for each value of the attribute (4,2) vertical (6,7) vertical (7,4) Choose attribute with the lowest error rate: % of instances that don t belong to most frequent class for the value (2,2) (3,8) (2,2) (4,2) Adapted from Witten et al. a1 4

5 Example Data # Hair Height Weight Lotion Sunburn? 1 blonde average light no yes 2 blonde tall average yes no 3 brown short average yes no 4 blonde short average no yes 5 red average heavy no yes 6 brown tall heavy no no 7 brown average heavy no yes 8 blonde short light yes no 1R on example data Attribute Rules Errors Total Errors Hair Blonde è yes 2/4 3/8 Brown è no 1/3 Red è yes 0/1 Height Short è no 1/3 1/8 Average è yes 0/3 Tall è no 0/2 Weight Light è yes 1/2 3/8 Average è no 1/3 Heavy è yes 1/3 Lotion Yes è no 0/3 1/8 No è yes 1/5 1R Extension Hyperpipes Procedure: Construct one rule per class: conjunction of tests for subset of values on each attribute Prediction is the class returned by the most rules Simple form of voting/ensemble Supervised III: Decision Trees Models a discrete valued function Represent concepts as a series of decisions (answers to tests) Internal nodes are tests with branches to different answers Leaf nodes are the classification for the data element 5

6 Decision Trees, basic idea Example Decision Tree: Mushroom Classification Y if X > 5 then yellow else if Y > 3 then yellow else if X > 2 then blue else yellow Yellow? Large? Spotted? 3 poisonous edible Big spots? poisonous 2 5 X Figure/example taken from by Piatetsky & Parker edible poisonous Lopsided Binary Tree for Classifying Cards Using Information Theory to Choose What to Ask When yes Ace? yes 7? no 8? 9? Card Target? Α yes 7 yes 8 no 9 yes Can we characterize how much it is worth to get an answer to a question? Given a probability distribution, the information required to predict an event is the distribution s entropy One bit of information is enough to answer a yes or no question General case: Entropy is Information n I(P(v 1 ),...,P(v n )) = P(v i )log 2 P(v i ) i=1 yes no Boolean case: I(P(t),P( f )) = P(t)log 2 P(t) P( f )log 2 P( f ) 6

7 Relate this to Coin Tossing Consider information in tossing a fair coin. ' 1 1 $ I%, " = log log2 & 2 2 # = 1bit Consider an unfair coin, heads 1 in 100 tosses. ' 1 99 $ I%, " = log2 log2 =.08 bit & # Efficiently Splitting the Cases Assume world is divided into two classes: P (positive) and N (negative). Assumptions for test are: 1. Any correct decision tree will classify objects in the same proportion as their representation in the example set. An object will belong to P with probability p/p+n and N with probability n/n+p. 2. A decision tree returns P or N with expected information needed given by: I(p,n) = p p + n log p 2 p + n n p + n log n 2 p + n Information content of knowing whether xεp or xεn. Information Gain If attribute A with values {A1,A2, Av} is the root, it will partition the examples into {C1,C2, Cv} where Ci contains examples that have value Ai of A. The expected information required for tree with A as root is: v p Remainder(A) = i + n i I(p p + n i,n i ) i=1 The information gained by branching on A is: gain(a) = I (p,n) Remainder(A) Example: Training Set # Outlook Temperature Humidity Windy Class 1 sunny hot high false N 2 sunny hot high true N 3 overcast hot high false P 4 rain mild high false P 5 rain cool normal false P 6 rain cool normal true N 7 overcast cool normal true P 8 sunny mild high false N 9 sunny cool normal false P 10 rain mild normal false P 11 sunny mild normal true P 12 overcast mild high true P 13 overcast hot normal false P 14 rain mild high true N 7

8 Example Trees Example: Information outlook humidity sunny {N,N,N,P,P} overcast {P,P,P,P} rain {P,P,P,N,N} High {N,N,N,N,P,P,P} Normal {P,P,P,P,P,P,N} Temperature windy Hot {N,N,P,P} Mild {P,P,P,P,N,N} Cool {P,P,P,N} False {P,P,P,P,P,P,N,N} True {P,P,P,N,N,N} I(p,n) = 9 14 log log = 0.94bits Attribute outlook is {sunny, overcast,rainy} sunny p 1 = 2 n 1 = 3 I(p 1,n 1 ) = overcast p 2 = 4 n 2 = 0 I( p 2,n 2 ) = 0.0 rainy p 3 = 3 n 3 = 2 I (p 3,n 3 ) = Example (cont.) Remainder(outlook) = 5 14 I( p 1,n 1 ) I( p 2,n 2 ) I( p 3,n 3 ) = 0.69bits gain(outlook) = Remainder(outlook) = 0.246bits Example (cont.) Continued analysis yields: gain(outlook) = gain(temperature) = gain(humidity) = gain(windy) =

9 Purity Measure Necessary properties: When node is pure, measure should be 0 When impurity is maximal (i.e., all classes equally likely), measure should be maximal Measure should obey multistage property. Decisions can be made in several stages. ID3: Using Information Gain Given as inputs a set of attributes A, an overall set S and a target concept T: 1. If every element of S is in T, then return yes ; If no element is in T, return no. 2. Otherwise, let a be the most discriminating element (highest gain) of A. Return a tree whose left branch includes all attributes remaining except a, all members of S also in a, and the target; the right branch includes everything not satisfying a. 3. Recurse building the tree incrementally. C4.5: Extension to ID3 Accommodates Missing attribute values In building a tree by considering only records with the values In using a tree by estimating probability distribution for different values from observed cases Continuous values Generates partitions according to observed values and considers values up to each partition value as attributes Supervised IV: Statistical Modeling Assumptions about attributes Equally important Statistically independent Knowing the value of one attribute says nothing about the value of another if the class is known In practice, the assumptions do not hold, but can still derive fairly accurate models! 9

10 Bayes s rule Weather example revisited Probability of event H given evidence E : Pr[ E H ]Pr[ H ] Pr[ H E] = Pr[ E] A priori probability of H : Pr[H ] Probability of event before evidence is seen A posteriori probability of H : Pr[ H E] Probability of event after evidence is seen Outlook Temp Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N A Priori probability Pr[P]=9/14=0.64 Pr[N]=5/14=0.36 A Posteriori probability for Outlook Pr[P sunny]=2/5=0.40 Pr[P overcast]=4/4=1.0 Pr[P rain]=3/5=0.60 Naïve Bayes Classifier Weather example revisited Based on Bayes Rule: p(c F 1,F 2,...,F n ) = p(c)p(f 1,F 2,...,F n C) p(f 1,F 2,...,F n ) Simplify by assuming independence, i.e., p(f i C,F j ) = p(f i C) Produces: p(c F 1,F 2,...,F n ) = p(c) p(f i C) n i=1 Outlook Temp Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N Sunny Cool High True?? Pr[P E]=Pr[Outlook=sunny P] x Pr[Temp=cool P] x Pr[Humidity=high P] x Pr[Windy=true P] x Pr[P]/Pr[E] 2 = Pr[E] = Pr[E] 10

11 Zero Frequency Problem An attribute value does not occur with every class value, e.g., Outlook=overcast and class N. Pr[Outlook=overcast N]=0 Pr[N Outlook=overcast]=0 Remedy: add 1 to the count for every attribute value-class combination No 0 probabilities Stabilizes estimates Classifier from NBC Model Typically, select classification based on MAP decision rule: Maximum A Posteriori rule. classify( f 1, f 2,...f n ) = argmax c P(C = c) p(f i = f i C = c) n i=1 Unsupervised I: Cluster or Outlier Analysis group together similar data, usually when class labels are unknown detect anomalous data example: connection features that do not match normal access or classify types (assume members share some characteristics) example: connection features of a smurf network attack attack normal unknown Clustering in General A cluster is a cone in n-dimensional space: members are found within ε of an average vector. Different techniques for determining the cluster vector and ε. Hard clustering techniques assign a class label to each cluster; members of clusters are mutually exclusive. Fuzzy clustering techniques assign a fractional degree of membership to each label for each x. 11

12 Clustering Decisions Proximity Measures Pattern Representation feature selection (which to use, what type of values) number of categories/clusters Pattern proximity distance measure on pairs of patterns Grouping characteristics of clusters (e.g., fuzzy, hierarchical) Clustering algorithms embody different assumptions about these decisions and the form of clusters. Generally, use Euclidean distance or mean squared distance. May need to normalize or use metrics less sensitive to magnitude differences (e.g., Manhattan distance aka L1 norm or Mahalanobis distance) May have domain specific distance measure, e.g., cosine between vectors for information retrieval. Categories of Clustering Techniques Partitional: Incrementally construct partitions based on some criterion (e.g., distance from data to cluster vector). Hierarchical: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Model-based: A model is hypothesized for each of the clusters Find the best fit of that model to each other. E.g., Bayesian classification (AutoClass), Cobweb. Partitional Algorithms Results in set of unrelated clusters. Issues: how many clusters is enough? usually set by user how to search space of possible partitions? incremental, batch, exhaustive what is appropriate clustering criterion? global optimal heuristic: distance to centroid (k-means), distance to some median point (k-mediods) 12

13 K Means Number of clusters is set by user to be k. Non-deterministic Clustering criterion is squared error: e( S, L) j = K n j= 1 i= 1 c where S is data set, L is a clustering, K is number of clusters, x is ith instance in jth cluster and c is centroid of jth cluster. x j i j 2 k-means Clustering Algorithm 1. Randomly select k samples as cluster centroids. 2. Assign each pattern to the closest cluster centroid. 3. Recompute centroids. 4. If convergence criterion (e.g., minimal decrease in error or no change in cluster composition) is not met, return to 2. Example:K-Means Solutions k-means Sensitivity to Initialization A C B D E K=3, red started w/a, D, F; yellow w/a, B, D F G 13

14 Evaluation Metrics How to measure cluster quality? Members of each cluster should share some relevant properties. high intra-class similarity low inter-class similarity Another metric for cluster quality: Entropy E = p log( p ) E j CS = m i j= 1 j ij n * E n j where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of instances and CS is a clustering solution. ij Associations frequently co-occurring features example market basket problem: Amazon: customers who bought Tom Clancy novels also bought Robert Ludlum apocryphal: people who buy diapers are also likely to buy beer (but not the converse) A1 Am B1 Bn where A & B are attribute value pairs i j What makes good association rules? Terminology Interestingness metrics support confidence attributes of interest rules with a specific structure Comparing rule sets completeness: generate all interesting rules efficiency: generate only rules that are interesting Each DB transaction is a set of items (called an itemset). A k-itemset is an itemset containing sets of cardinality k. An association rule is A=>B, such that A and B are sets of items with empty set as their intersection and including items that co-occur in transactions. 14

15 Terminology (cont.) Association Rule Mining A rule has support s, where s is % of transactions that contain both A and B: P(A B). (Purple/Yellow) A rule has confidence c, where c is % of transactions containing A that also contain B: P(B A). (Purple/Red) A Rules with support and confidence exceeding a threshold are called strong. A frequent itemset is the set of itemsets whose minimum support exceeds a threshold. B Process 1) Find all frequent itemsets based on minimum support 2) Identify strong association rules from the frequent itemsets Rule Representations types of values: Boolean, quantitative (binned) # of dimensions/predicates levels of abstraction/type hierarchy Apriori Algorithm for Finding Single- Dimensional Boolean Itemsets Apriori property: all nonempty subsets of a frequent itemset must also be frequent. Generate k-itemsets by joining large k-1-itemsets and deleting any that is not large. Assume items in lexicographic order Notation: k-itemset itemset of size k Lk Ck k-itemsets with minsup candidate k-itemsets Apriori Algorithm L 1 = {large 1 itemsets} count item frequency for(k = 2;L k 1 {};k + +)do begin Ck = apriori gen( Lk 1); new candidates transactions t D do begin C = subset( C, t); t k candidates in transaction candidates c Ct do c. count + +; determine support end Lk = { c Ck c. count minsup} create new set end Answer = L ; k k 15

16 Apriori-gen: join Apriori gen( Lk 1) insert into Ck select p. item1, p. item2,..., p. item from Lk 1 p; L q where p. item k 2, p. item k 1 1 = q. item1,..., p. itemk 2 = q. itemk 2, p. itemk 1 < q. itemk 1; L2={{a b},{a c},{a g},{b c},{b f},{c e},{c f},{e f}} p={a b}, q={a c}, add {a b c} to C3 p={a b}, q={a g}, add {a b g} to C3 p={a c}, q={a g}, add {a c g} to C3 p={b c}, q={b f}, add {b c f} to C3 p={c e}, q={c f}, add {c e f} to C3 k 1 join Apriori-gen: prune Apriori gen( Lk 1) join itemsets c C do k prune ( k 1) subsets s of c do if (s L ) then delete c from C ; k-1 k L2={{a b},{a c},{a g},{b c},{b f},{c e},{c f},{e f}} C3={{a b c},{a b g},{a c g},{b c f},{c e f}} {a b c} => {{a b},{a c},{b c}} TRUE {a b g} => {{a b},{a g},{b g}} FALSE Apriori Algorithm L1 = {large 1 itemsets} for( k = 2; L {}; k + + }do begin k 1 Ck = apriori gen( Lk 1); transactions t D do begin C = subset( C, t); t k candidates c Ct do c. count + +; end Lk = { c Ck c. count minsup} end Answer = L ; k k count item frequency new candidates candidates in transactions determine support create new set Apriori: Subset Find all transactions associated with candidate itemsets. Candidate itemsets are stored in hash-tree Leaves contain a list of itemsets. Interior nodes contain a hash table depth corresponds to position of item of itemset Find all candidates contained in transaction t: if at leaf, check itemsets against t if interior for ith item, hash on items after i in t and recurse 16

17 Apriori Example D=((c e f),(b c e f),(b d),(b f g),(a c e),(a b),(a b c),(a c g),(c e f),(a d g),(c f),(a b c e f),(a f g),(c e f g),(a b c g),(a),(a g),(c e)) minsup=4 L1={{a 10},{b 6},{c 11},{e 7},{f 8},{g 7}} no d C2={{a b},{a c},{a e},{a f},{a g},{b c},{b e},{b f},{b g},{c e},{c f}, {c g},{e f},{e g},{f g}} after join&prune L2={{a b 4},{a c 5},{a g 5},{b c 4},{c e 7},{c f 6},{e f 5}} after support C3={{a b c},{c e f}} after join&prune L3={{c e f 5}} after support Generating Association Rules from Supported Itemsets Rules take form A=>B where A & B are subsets of items that co-occur. Rules have support above minsup threshold and confidence above minconf thresold. support_count(a B) confidence(a B) = P(B A) = support_count(a) support_count(a B) = transactions containing itemsets A B support_count(a) = transactions containing itemsets A Apriori Problems Generating candidate itemsets: n frequent 1-itemsets generates (n(n-1))/2 candidate 2-itemsets generate itemsets for each length Multipass over data for frequencies Generating Association Rules from Supported Itemsets (cont.) Generate all rules such that: for each frequent itemset i of cardinality > 1, generate all non-empty subsets. for every non-empty subset s, if confidence(s=>(i-s))>minconf, create rule s=>(i-s) Can use cached itemset frequencies for confidence calculation. 17

18 Association Rule Example (cont.) Frequent itemsets: {{c e f 5},{a b 4},{a c 5},{a g 5}, {b c 4},{c e 7},{c f 6},{e f 5}} 1. Non-empty subsets of {c e f}:{c e},{c f},{e f},{c}, {e},{f} 2. confidence({c e}=>{f})=5/7= confidence({c f}=>{e})=5/6= confidence({e f}=>{c})=5/5= confidence({c}=>{e f})=5/11= confidence({e}=>{c f})=5/7= confidence({f}=>{c e})=5/8=.62 Interestingness Revisited P( A B) Correlation/Lift P( A) P( B) P(A^B)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated X Y Z Itemset Support Interest X,Y 25% 2 X,Z 37.50% 0.9 Y,Z 12.50% 0.57 revised from Han&Kamber notes at Other Methods for Association Mining Improving Apriori efficiency: Partitioning Sampling Multi-level or generalized association Multi-Dimensional association Quantitative association rule mining Constraint-based or query-based association Correlation analysis 18

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"