Unsupervised Learning. k-means Algorithm

Similar documents
CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Tools of AI. Marcin Sydow. Summary. Machine Learning

732A61/TDDD41 Data Mining - Clustering and Association Analysis

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Algorithms for Classification: The Basic Methods

Bayesian Classification. Bayesian Classification: Why?

Mining Classification Knowledge

Bayesian Learning. Bayesian Learning Criteria

Modern Information Retrieval

Mining Classification Knowledge

Decision Trees. Gavin Brown

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Association Rule Mining on Web

Decision Trees Part 1. Rao Vemuri University of California, Davis

Motivating the Covariance Matrix

n Classify what types of customers buy what products n What makes a pattern interesting? n easily understood

Clustering VS Classification

Data Preprocessing. Cluster Similarity

The Solution to Assignment 6

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Decision Tree Learning and Inductive Inference

Classification techniques focus on Discriminant Analysis

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Data Mining. Chapter 1. What s it all about?

Induction on Decision Trees

Machine Learning Alternatives to Manual Knowledge Acquisition

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Correlation Preserving Unsupervised Discretization. Outline

Lecture 3: Decision Trees

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Classification and Prediction

Learning Classification Trees. Sargur Srihari

Data Exploration and Unsupervised Learning with Clustering

Machine Learning Chapter 4. Algorithms

Module Master Recherche Apprentissage et Fouille

Machine learning for pervasive systems Classification in high-dimensional spaces

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Intuition Bayesian Classification

Lecture 3: Decision Trees

Learning Decision Trees

Slides for Data Mining by I. H. Witten and E. Frank

Decision Support. Dr. Johan Hagelbäck.

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals


Symbolic methods in TC: Decision Trees

Data Mining and Machine Learning

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

Decision Trees. Tirgul 5

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Classification Using Decision Trees

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

L11: Pattern recognition principles

Machine Learning 2nd Edi7on

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Decision Tree Learning

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Administrative notes. Computational Thinking ct.cs.ubc.ca

Data classification (II)

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

CS 6375 Machine Learning

Density-Based Clustering

PATTERN CLASSIFICATION

Numerical Learning Algorithms

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Decision Trees.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Statistical Machine Learning

Rule Generation using Decision Trees

Chapter 3: Decision Tree Learning

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Artificial Intelligence. Topic

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Classification: Decision Trees

Modern Information Retrieval

Principal component analysis (PCA) for clustering gene expression data

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Course in Data Science

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Learning Decision Trees

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

Quiz3_NaiveBayesTest

Data Mining and Analysis: Fundamental Concepts and Algorithms

Decision Trees (Cont.)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Transcription:

Unsupervised Learning Supervised Learning: Learn to predict y from x from examples of (x, y). Performance is measured by error rate. Unsupervised Learning: Learn a representation from exs. of x. Learn how the exs. are similar to/different from each other. No standard way to measure performance. Clustering. Group the data maximizing similarity. Component Analysis. features. Extract the most descriptive Association Rules. Find common combinations of attributes. k-means Algorithm Goal: Find k centers c,..., c k to minimize: Σ i min j d(x i, c j ) where d is a distance measure (typically Euclidean). Choose k that makes a big decrease. k-means Repeatedly Initialize centers c,..., c k to k examples. Repeat until convergence For each example x i, find its closest center. Set each center to the mean of its examples. Return the best centers found

3 Clusters in Iris -Attribute Dataset.5 setosa versicolor virginica Petal Width.5 3 4 5 6 7 Petal Length 4 Clusters in Glass -Attribute -Class Dataset.5 build wind float build wind non-float Aluminum.5 5 5 3 35 4 Refractive Index

Expectation-Maximization Goal: Find a (local) maximum-likelihood estimate of unknown parameters Z governed by a probability distribution. Restated: Over a set of examples, find a hypothesis h that maximizes: Σ i Pr(x i, z i h) where each x i is known and each z i is unknown. require: z i can be predicted from h. h can be predicted from z i. We EM Algorithm EM Repeatedly (in case of local maxima) Initialize hypothesis h Repeat until convergence // Expectation Step For each example x i, z i predictions from h. // Maximization Step h hypothesis assuming z i Return the best hypothesis found

Learning Mixtures of k Normal Distributions Suppose we assume the examples are in k clusters, each corresponding to a normal distribution. Each distribution is defined by a mean vector u j (the means of the attributes) and a covariance matrix Σ j. For each example x i, we can predict: z ij = Pr(x i u j, Σ j ) the probability that x i belongs to cluster j. Given z i, we can predict means by (similarly covariances): u j = Σ z ij x i / Σ z ij i i 3 Clusters in Simplified Iris Dataset.5 setosa versicolor virginica Petal Width.5 3 4 5 6 7 Petal Length

4 Clusters in Simplified Glass Dataset.5 Aluminum.5 5 5 3 35 4 Refractive Index Hierarchical Clustering In hierarchical clustering, the examples are leafs in a tree and each internal node is a cluster. This algorithm is agglomerative (bottom-up) hierarchical clustering. Hierarchical-Clustering For each example x i, C i {x i } Repeat until one cluster remains Find the two nearest clusters C i and C i Merge C i and C i into one cluster Return the cluster hierarchy

Distances between Clusters There are several distance measures between clusters. Nearest Neighbor. The minimize distance between an example in C i and an example in C i. This encourages elongated clusters. Farthest Neighbor. The maximum distance between an example in C i and an example in C i. Mean. The distance between the means of C i and C i. In the following, intermediate results are shown for measuring distance by nearest neighbor (first two graphs) and mean (last two graphs). Simplified Iris Dataset (Nearest Neighbor).5 setosa versicolor virginica Petal Width.5 3 4 5 6 7 Petal Length

Simplified Glass Dataset (Nearest Neighbor).5 Aluminum.5 5 5 3 35 4 Refractive Index Simplified Iris Dataset (Mean).5 setosa versicolor virginica Aluminum.5 3 4 5 6 7 Refractive Index

Simplified Glass Dataset (Mean).5 Aluminum.5 5 5 3 35 4 Refractive Index Dendograms A dendogram illustrates the hierarchical clustering. Here is a dendogram for the numbers, 3, 5, 7,, 3, 7, 9 using mean distance. Distance 8 6 4 3 5 7 3 7 9

Association Rules An association rule indicates that a set of items implies another item with levels of confidence and support. In market basket analysis, the goal is to determine what items are bought with other items, e.g., cereal milk. For our purposes, an item specifies attribute=value (later discuss < and >). Each example x is treated as a transaction, which is a set of items. An association rule is an implication A B, where A and B are sets of items (typically though B = ). Confidence and Support The support of a set of items A is the number of examples x where A x. The support of an association rule A B is the support of A B. The confidence of an association rule A B is support(a B)/support(A). support(a) corresponds to Pr(A). confidence(a B) corresponds to Pr(B A).

No. Attributes Class Outlook Temp Humidity Windy sunny hot high false neg sunny hot high true neg 3 overcast hot high false pos 4 rain mild high false pos 5 rain cool normal false pos 6 rain cool normal true neg 7 overcast cool normal true pos 8 sunny mild high false neg 9 sunny cool normal false pos rain mild normal false pos sunny mild normal true pos overcast mild high true pos 3 overcast hot normal false pos 4 rain mild high true neg Examples Here are some association rules from the weather dataset. Association Rule Conf Sup overcast pos 4/4 4 cool normal 4/4 4 (sunny neg) high 3/3 3 normal pos 6/7 6 neg high 4/5 4 cool (normal pos) 3/4 3 mild high 4/6 4 (normal pos) false 4/6 4

Frequent Itemsets The Apriori algorithm for finding association rules is based on finding frequent k-itemsets, a set of k items with minimum support, which is a parameter to Apriori. If we require a minimum support of 5, then: Frequent -itemsets sunny true rainy false mild pos high neg normal Frequent -itemsets normal pos false pos Apriori Algorithm Apriori L = frequent -itemsets For k from on up C k Apriori-gen(L k ) // Candidates For each example x i C k,i {c : c C k c x i } For each candidate c C k,i Increment c.count L k {c : c C k c.count min. support} Exit loop if L k = Return L,..., L k

Apriori Candidate Generation Apriori-gen(L k ) // Assuming L k and each c L k is sorted // Join step For each itemset pair p L k and q L k If p = q except for their last items Then insert p q into C k // Prune step For each candidate c C k If some (k )-subset of c is not in L k Then remove c from C k Return C k Sorted sets allows efficient candidate generation. Principal Components Analysis PCA is one of several techniques to analyze and/or extract the most relevant attributes from a dataset. PCA identifies and ranks n new attributes in the dataset that are linear combinations of the n old attributes. The ranking is by variance reduction, i.e, by the variance in the new attributes. The new attributes are orthogonal to each other, so the full result is just a linear transformation; however, we can select those attributes which account for the most variance in the data (dimensionality reduction).

Principal Component of Simplified Iris 3.5 Petal Width.5 3 4 5 6 7 Petal Length Principal Component of Simplified Glass.5 Aluminum.5 5 5 3 35 4 Refractive Index

Finding Principal Components Principal components are eigenvectors of the correlation matrix. If the examples x have n attributes with mean u, then the n n correlation matrix A is defined by: A ij = Σ(x x i u i )(x j u j ) (x i u i ) (x j u j ) x x An n eigenvector e with eigenvalue λ satisfies: Ae = λe The eigenvectors are ranked by decreasing eigenvalues. Algorithms are in the online Numerical Recipes book. Examples In each dataset, the attributes have been standardized: Using 4-attribute Iris, top two principal components are:.5 slen.69 swid +.58 plen +.565 pwid.377 slen.93 swid.4 plen.67 pwid Using 9-attribute Glass ( classes), the top two are:.55 RI.38 Na.359 Mg.68 Al.344 Si.38 K +.49 Ca.8 Ba +.68 Fe.4 RI.56 Na.34 Mg +.47 Al.4 Si +.37 K +.4 Ca +.379 Ba +.38 Fe

Top Two Principal Components of Iris.3. Iris PCA Component. -. -. -.3 -.3 -. -....3 Iris PCA Component Top Two PCs of Glass ( classes).4.3 Glass PCA Component.. -. -. -.3 -.4 -.4 -.3 -. -....3.4 Glass PCA Component (Some green X s not shown above and right)

Comments on Unsupervised Learning Unsupervised learning can be useful for finding interesting patterns; however they can often find irrelevant patterns. City = San Antonio State = Texas Difficult to determine number of clusters or components. Clustering: Choose k if higher does not improve much. For PCA, eliminate eigenvalues with low eigenvalues. This is a small sample of unsupervised techniques. Besides the Duda, Hart, Stork book, consider: T. Hastie, R. Tibshirani and J. Friedman (). The Elements of Statistical Learning. Springer-Verlag. R. E. Neapolitan (4). Learning Bayesian Networks. Prentice-Hall.