1 Basic Concept and Similarity Measures

Size: px
Start display at page:

Download "1 Basic Concept and Similarity Measures"

Transcription

1 THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay Lecture 10: Cluster Analysis and Multidimensional Scaling 1 Basic Concept and Similarity Measures Cluster analysis attempts to group similar observations together based on some distance measure. It differs from the classification discussed before, because the number of groups is assumed to be known in classification. The inputs of cluster analysis are similarity measures or data from which similarities can be computed. Similarity measures: There are many measures available; they all involve certain degrees of subjectivity. Important considerations include the nature of the variables, scales of measurement, and subject matter knowledge. Below are some distances and similarity coefficients for pairs of items. 1. Euclidean distance between two p-dimensional observations x and y d(x, y) = 2. Statistical distance: (x 1 y 1 ) 2 + (x 2 y 2 ) (x p y p ) 2 = (x y) (x y). d(x, y) = where S is the sample covariance matrix. (x y) S 1 (x y), 3. Minkowski metric: 4. Canberra metric: 5. Czekanowski coefficient: [ p ] 1/m d(x, y) = x i y i m. i=1 d(x, y) = p i=1 x i y i (x i + y i ) d(x, y) = 1 2 p i=1 min(x i, y i ) p. i=1(x i + y i ) Remark. Distance measure needs to satisfy the following basic properties: d(p, Q) = d(q, P )

2 d(p, Q) 0 if P Q d(p, Q) = 0 if P = Q d(p, Q) d(p, R) + d(r, Q), the triangle equality. When items cannot be represented by meaningful distance measurments, pairs of items are often compared on the basis of the presence or absence of certain characteristics, resulting in using the binary variable. Similar items have more characteristics in common than do dissimilar items. After tabulating the binary variables, one can define similarity coefficients between items as a distance measure. See Table 12.1 of the textbook for a list of similarity coefficients. 2 Hierarchical Clustering Methods Hierarchical clustering techniques proceed by either a series of successive mergers or a series of successive divisions. Agglomerative hierarchical methods start with the individual objects and merge similar objects into groups iteratively. Divisive hierarchical methods work in the opposite direction. It starts with a single group and divides iteratively objects into subgroups so that objects in one subgroup are far from those of the other subgroup. Following the textbook, we discuss agglomerative hierarchical procedures. In particular, we focus on three linkage methods. The results of clustering analysis can be shown by a two-dimensional diagram known as a dendrogram. Suppose that there are N objects associated with them is an N N symmetric matrix of distance (or similarity) D = [d ij ]. For any two groups U and V, denote the distance between them as d(u, V ) = d UV. Single linkage: (minimum distance or nearest neighbor). 1. Find the minimum distance between the objects to form a cluster of two objects, say (UV). 2. The distance between (UV) and any other object W is defined as d (UV )W = min{d UW, D V W }. 3. Find a new group from the resulting new distance matrix 4. Iterative the process. See the demonstration. Complete linkage: (maximum distance) 2

3 Based on similar procedure as the single linkage, group objects together by the minimum distance, but updates the distance using maximum, d (UV )W = max{d UW, D V W }. Average linkage: (average distance) Updates the distance using average. Consider two subgroups (UV ) and W. Then, i k dik d (UV )W =, N (UV ) N W where i and k denote member in (UV ) and W, respectively, and N (UV ) and N W number of objects in the subgroups (UV ) and W. are the Demonstration: In R, the commands for hierarachical clustering is hclust. Demonstration of hierarchical clustering analysis *** Data: Table 12-4, 11 languages *** Commands used "hclust", "dist", and "plot". *** Created a data matrix based on Example 12.2 on page 680. da=read.table("t12-3a.dat") da V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V colnames(da) <- c("e","n","da","du","g","fr","sp","i","p","h","fi") dd=10-da % compute a distance measure d1=as.dist(dd) m1=hclust(d1,method="single") 3

4 plot(m1) % The program groups "Fr", "Sp" and "I" in one-step. % See Figure 12.4 on page 685. m2=hclust(d1) % default is "complete" linkage. plot(m2) % See Figure 12.7 on page 687. m3=hclust(d1,method="average") plot(m3) % See Figure 12.9 on page 690. *** Public Utility Data on Table 12.4 da=read.table("t12-4.dat") dim(da) [1] 22 9 da[1,] V1 V2 V3 V4 V5 V6 V7 V8 V Arizona x=da[,1:8] aa =cor(x) print(aa,digits=3) V1 V2 V3 V4 V5 V6 V7 V8 V V V V V V V V d1=as.dist(aa) print(d1,digits=3) V1 V2 V3 V4 V5 V6 V7 V V V V V V V

5 d2=-d1 % High correlations mean more similar. Distance should be shorter. % Thus, the sign is changed. m4=hclust(d2) plot(m4) % See Figure 12.8 on page 689. x1=var(x) print(x1,digits=3) V1 V2 V3 V4 V5 V6 V7 V8 V e e e e-03 V e e e e-01 V e e e e-01 V e e e e+00 V e e e e-02 V e e e e+03 V e e e e+00 V e e e e-01 se=sqrt(diag(x1)) print(se,digits=3) V1 V2 V3 V4 V5 V6 V7 V a1=diag(se) a1inv=solve(a1) y=as.matrix(x)%*%a1inv print(var(y),digits=3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] dd=dist(y,method="euclidean") print(dd,digits=2)

6 (Output edited. This is Table 12.6 of the textbook, p.691. m5=hclust(dd) plot(m5) % See Figure on page Nonhierarchical Clustering Methods K-means method: MacQueen (1967) considers k-means for describing an algorithm to assign each item to the cluster having the nearest centroid. In its simplest version, the process consists of three steps. 1. Partition the items into K initial clusters. 2. Proceed through the list of items, assigning an item to the cluster whose centroid is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. 3. Repeat Step 2 until no more reassignments take place. Demonstration: In R, the command is kmeans. *** Non-hierarchical clustering method *** *** Data: The standardized public utility companies. help(kmeans) m6=kmeans(y,4) names(m6) [1] "cluster" "centers" "withinss" "size" m6$cluster [1] print(m6$centers,digits=3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] m6$size 6

7 [1] m6$withinss [1] d1=dist(m6$centers,method="euclidean") print(d1,digits=3) *** 5 groups *** m7=kmeans(y,5) m7$cluster [1] m7$size [1] m7$withinss [1] print(m7$centers,digits=3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] d1=dist(m7$centers,method="euclidean") print(d1,digits=3) Model-based approach to clustering with applications We shall discuss the following two papers: You may download the two papers from UC library. 1. Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis 7

8 and density estimation. Journal of the American Statistical Association, 97, Frühwirth-Schnatter, S. and Kaufmann, S. (2008). Model-based clustering of multiple time series. Journal of Business & Economic Statistics, 26, Multidimensional Scaling Problem: Suppose that there are N items with N(N 1)/2 pairs of similarities (or distances). Multidimensional scaling is to find a representation of the items in few dimensions such that the inter-item proximities nearly match the original similarities (or distances). Stress: The numerical measure of closeness between the original similarities and the fitted values of similarities. In practice, if only the rank orders of the N(N 1)/2 original similarities are used, the process is called non-numetric multidimensional scaling. If the actual magnitudes of the original similarities are used, the process is called metric multidimensional scaling. The latter is also known as principal coordinate analysis. Denote the similarity between items i and j by s ij. Assume there are no ties in similarities. Arrange the similarities in a strictly ascending order as s i1 k 1 < s s2 k 2 < < s im k M (1) where M = N(N 1)/2. (If distances are used, arrange in a strictly decreasing order.) Multidimensional scaling attempts to find a q-dimensional configuration of the N items such that the distance d (q) ik, between the pairs of items, match the ordering in (1). A perfect match occurs if d (q) i 1 k 1 d (q) i 2 k 2 d (q) i M k M. (2) Kruskal (1964) proposed stress as a measure of closeness. It is defined as ˆd (q) ik i<k(d (q) (q) 1/2 Stress(q) = ik ˆd ik )2, i<k[d (q) ik ]2 where is the fitted distance and d iks are the distances that correspond to the perfect match with similarities s ik. General guidelines Stress Goodness of Fit 20% Poor 10% Fair 5% Good 2.5% Excellent 0% Perfect 8

9 Takane, et al. (1977) proposed an alternative measure of closeness, [ SStress = i<k(d 2 ik ˆd 2 ik) 2 ]. i<k d 4 ik The value of SStress is always between 0 and 1. Any value less than 0.1 is typically taken to mean that there is a good representation of the objects by the points in the given configuration. Algorithm: See page 709 of the textbook. 1. Obtain the N(N 1)/2 similarities between distinct pairs of items. Order the similarities as in (1). Typically, distances are used. 2. Using a trial configuration in q dimensions, determine the inter-item distances d (q) ik and estimates so as to minimize the Stress or SStress measure. 3. Using ˆd (q) ik (q) ˆd ik s, move the points around to obtain an improved configuration. 4. Choose the dimension q via checking the Stress or SStress measure. Demonstration: In R, the multidimensional scaling command is cmdscale. It stands for classical multidimensional scaling. setwd("c:/teaching/ama") da=read.table("t12-4.dat") dim(da) [1] 22 9 x=da[,1:8] s1=sqrt(diag(var(x))) print(s1,digits=3) V1 V2 V3 V4 V5 V6 V7 V sinv=diag(1/s1) y=as.matrix(x)%*%sinv dd=dist(y,method="euclidean") help(cmdscale) mm=cmdscale(dd,4) % use 4-dimension to fit the data (q=4) print(mm,digits=3) [,1] [,2] [,3] [,4] [1,] [2,]

10 [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] m2=cmdscale(dd,2) % Use 2-dimensional fit. plot(m2[,1],m2[,2]) text(m2[,1],m2[,2],c(1:22)) ** Another example: Airline distance da=read.table("t12-7m.dat",header=t) da Ata Bos Cin Col Dal Ind LRk LAs Mem StL Spo Tpa

11 d1=as.dist(da) d1 Ata Bos Cin Col Dal Ind LRk LAs Mem StL Spo Bos 1068 Cin Col Dal Ind LRk LAs Mem StL Spo Tpa mc=cmdscale(d1,2) plot(mc[,1],mc[,2]) % see Figure on page 711. text(mc[,1],mc[,2],colnames(da)) plot(-mc[,1],-mc[,2]) % To match precisely with Figure text(-mc[,1],-mc[,2],colnames(da)) *** A third example **** da=read.table("t12-9.dat") dim(da) [1] 25 7 x=da[,2:7] x V2 V3 V4 V5 V6 V

12 s1=sqrt(diag(var(x))) ainv=diag(1/s1) y=as.matrix(x)%*%ainv d1=dist(y,method="euclidean") md=cmdscale(d1,2,eig=t) names(md) [1] "points" "eig" "x" "ac" "GOF" plot(-md$points[,1],-md$points[,2]) % See Figure on page 715. text(-md$points[,1],-md$points[,2],da[,1]) 12

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher

More information

Proximity data visualization with h-plots

Proximity data visualization with h-plots The fifth international conference user! 2009 Proximity data visualization with h-plots Irene Epifanio Dpt. Matemàtiques, Univ. Jaume I (SPAIN) epifanio@uji.es; http://www3.uji.es/~epifanio Outline Motivating

More information

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, blue, orange)} 1 / 16 k-means clustering km15 = kmeans(x[g==0,],5) km25 = kmeans(x[g==1,],5) for(i in 1:6831){ md = c(mydist(xnew[i,],km15$center[1,]),mydist(xnew[i,],km15$center[2, mydist(xnew[i,],km15$center[3,]),mydist(xnew[i,],km15$center[4,]),

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Final Exam

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Final Exam THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2014, Mr. Ruey S. Tsay Solutions to Final Exam 1. City crime: The distance matrix is 694 915 1073 528 716 881 972 464

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

Multivariate Statistics: Hierarchical and k-means cluster analysis

Multivariate Statistics: Hierarchical and k-means cluster analysis Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43 What is a cluster? Proximity

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

STATISTICA MULTIVARIATA 2

STATISTICA MULTIVARIATA 2 1 / 73 STATISTICA MULTIVARIATA 2 Fabio Rapallo Dipartimento di Scienze e Innovazione Tecnologica Università del Piemonte Orientale, Alessandria (Italy) fabio.rapallo@uniupo.it Alessandria, May 2016 2 /

More information

proximity similarity dissimilarity distance Proximity Measures:

proximity similarity dissimilarity distance Proximity Measures: Similarity Measures Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbor classification and anomaly detection. The

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Freeman (2005) - Graphic Techniques for Exploring Social Network Data

Freeman (2005) - Graphic Techniques for Exploring Social Network Data Freeman (2005) - Graphic Techniques for Exploring Social Network Data The analysis of social network data has two main goals: 1. Identify cohesive groups 2. Identify social positions Moreno (1932) was

More information

STAT 730 Chapter 14: Multidimensional scaling

STAT 730 Chapter 14: Multidimensional scaling STAT 730 Chapter 14: Multidimensional scaling Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Data Analysis 1 / 16 Basic idea We have n objects and a matrix

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Example for merging hierarchically Merging Apples Merging Oranges Merging Strawberries All together Hierarchical Clustering In hierarchical clustering the data are not partitioned

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Chapter 5: Cluster analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2015/2016 Master in Business Administration and

More information

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay. Solutions to Final Exam

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay. Solutions to Final Exam THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay Solutions to Final Exam 1. (13 pts) Consider the monthly log returns, in percentages, of five

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Cluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria.

Cluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria. 1 Testing Oct. 28, 2010 2 Testing Testing Agenda 3 Testing Review of Testing Testing Review of Testing 4 Test a parameter against a specific value Begin with H 0 and H 1 as the null and alternative hypotheses

More information

Machine Learning - MT Clustering

Machine Learning - MT Clustering Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016 Announcements No new practical this week All practicals must be signed off in sessions this week Firm Deadline:

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

Multivariate Analysis Cluster Analysis

Multivariate Analysis Cluster Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Cluster Analysis System Samples Measurements Similarities Distances Clusters

More information

Generalized Ward and Related Clustering Problems

Generalized Ward and Related Clustering Problems Generalized Ward and Related Clustering Problems Vladimir BATAGELJ Department of mathematics, Edvard Kardelj University, Jadranska 9, 6 000 Ljubljana, Yugoslavia Abstract In the paper an attempt to legalize

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr Ruey S Tsay Lecture 9: Discrimination and Classification 1 Basic concept Discrimination is concerned with separating

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Clusters. Unsupervised Learning. Luc Anselin.   Copyright 2017 by Luc Anselin, All Rights Reserved Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse

More information

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables

More information

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

LINGUIST 716 Week 9: Compuational methods for finding dimensions

LINGUIST 716 Week 9: Compuational methods for finding dimensions LINGUIST 716 Week 9: Compuational methods for finding dimensions Kristine Yu Department of Linguistics, UMass Amherst November 1, 2013 Computational methods for finding dimensions 716 Fall 2013 Week 9

More information

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric Axioms of a Metric Picture analysis always assumes that pictures are defined in coordinates, and we apply the Euclidean metric as the golden standard for distance (or derived, such as area) measurements.

More information

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were

More information

Machine Learning on temporal data

Machine Learning on temporal data Machine Learning on temporal data Learning Dissimilarities on Time Series Ahlame Douzal (Ahlame.Douzal@imag.fr) AMA, LIG, Université Joseph Fourier Master 2R - MOSIG (2011) Plan Time series structure and

More information

Lecture 10: Bourgain s Theorem

Lecture 10: Bourgain s Theorem princeton u. sp 0 cos 598B: algorithms and complexity Lecture 10: Bourgain s Theorem Lecturer: Sanjeev Arora Scribe:Daniel J. Peng The goal of this week s lecture is to prove the l version of Bourgain

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

Homework : Data Mining SOLUTIONS

Homework : Data Mining SOLUTIONS Homework 4 36-350: Data Mining SOLUTIONS I loaded the data and transposed it thus: library(elemstatlearn) data(nci) nci.t = t(nci) This means that rownames(nci.t) contains the cell classes, or equivalently

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that

More information

THE UNIVERSITY OF CHICAGO Booth School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay

THE UNIVERSITY OF CHICAGO Booth School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay THE UNIVERSITY OF CHICAGO Booth School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay Lecture 5: Multivariate Multiple Linear Regression The model is Y n m = Z n (r+1) β (r+1) m + ɛ

More information

Unsupervised clustering of COMBO-17 galaxy photometry

Unsupervised clustering of COMBO-17 galaxy photometry STScI Astrostatistics R tutorials Eric Feigelson (Penn State) November 2011 SESSION 2 Multivariate clustering and classification ***************** ***************** Unsupervised clustering of COMBO-17

More information

Interaction Analysis of Spatial Point Patterns

Interaction Analysis of Spatial Point Patterns Interaction Analysis of Spatial Point Patterns Geog 2C Introduction to Spatial Data Analysis Phaedon C Kyriakidis wwwgeogucsbedu/ phaedon Department of Geography University of California Santa Barbara

More information

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 77 Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 1) Introduction Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Some slides by Serafim Batzoglou 1 From expression profiles to distances From the Raw Data matrix we compute the similarity matrix S. S ij reflects the similarity of the expression

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

2. Sample representativeness. That means some type of probability/random sampling.

2. Sample representativeness. That means some type of probability/random sampling. 1 Neuendorf Cluster Analysis Model: X1 X2 X3 X4 X5 Clusters (Nominal variable) Y1 Y2 Y3 Clustering/Internal Variables External Variables Assumes: 1. Actually, any level of measurement (nominal, ordinal,

More information

Clustering Ambiguity: An Overview

Clustering Ambiguity: An Overview Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries:

More information

Discrimination Among Groups. Discrimination Among Groups

Discrimination Among Groups. Discrimination Among Groups Discrimination Among Groups Id Species Canopy Snag Canopy Cover Density Height 1 A 80 1.2 35 2 A 75 0.5 32 3 A 72 2.8 28..... 31 B 35 3.3 15 32 B 75 4.1 25 60 B 15 5.0 3..... 61 C 5 2.1 5 62 C 8 3.4 2

More information

Classification methods

Classification methods Multivariate analysis (II) Cluster analysis and Cronbach s alpha Classification methods 12 th JRC Annual Training on Composite Indicators & Multicriteria Decision Analysis (COIN 2014) dorota.bialowolska@jrc.ec.europa.eu

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia

More information

SC4/SM4 Data Mining and Machine Learning Clustering

SC4/SM4 Data Mining and Machine Learning Clustering SC4/SM4 Data Mining and Machine Learning Clustering Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/dmml Department of Statistics,

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Methods for Clustering Mixed Data

Methods for Clustering Mixed Data University of South Carolina Scholar Commons Theses and Dissertations 2014 Methods for Clustering Mixed Data JeanMarie L. Hendrickson University of South Carolina - Columbia Follow this and additional

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.2 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

Generalized Blockmodeling with Pajek

Generalized Blockmodeling with Pajek Metodološki zvezki, Vol. 1, No. 2, 2004, 455-467 Generalized Blockmodeling with Pajek Vladimir Batagelj 1, Andrej Mrvar 2, Anuška Ferligoj 3, and Patrick Doreian 4 Abstract One goal of blockmodeling is

More information

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data 1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

More information

Machine Learning for Data Science (CS4786) Lecture 2

Machine Learning for Data Science (CS4786) Lecture 2 Machine Learning for Data Science (CS4786) Lecture 2 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2017fa/ REPRESENTING DATA AS FEATURE VECTORS How do we represent data? Each data-point

More information

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

More information

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr Foundation of Data Mining i Topic: Data CMSC 49D/69D CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Data Data types Quality of data

More information

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms Alberto Fernández and Sergio Gómez arxiv:cs/0608049v2 [cs.ir] 0 Jun 2009 Departament d Enginyeria Informàtica i Matemàtiques,

More information

2. Sample representativeness. That means some type of probability/random sampling.

2. Sample representativeness. That means some type of probability/random sampling. 1 Neuendorf Cluster Analysis Assumes: 1. Actually, any level of measurement (nominal, ordinal, interval/ratio) is accetable for certain tyes of clustering. The tyical methods, though, require metric (I/R)

More information

Multivariate analysis of genetic data: exploring groups diversity

Multivariate analysis of genetic data: exploring groups diversity Multivariate analysis of genetic data: exploring groups diversity T. Jombart Imperial College London Bogota 01-12-2010 1/42 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate

More information

Dimensionality of Hierarchical

Dimensionality of Hierarchical Dimensionality of Hierarchical and Proximal Data Structures David J. Krus and Patricia H. Krus Arizona State University The coefficient of correlation is a fairly general measure which subsumes other,

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

Machine Learning for Data Science (CS4786) Lecture 8

Machine Learning for Data Science (CS4786) Lecture 8 Machine Learning for Data Science (CS4786) Lecture 8 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Announcement Those of you who submitted HW1 and are still on waitlist email

More information

Evaluating Goodness of Fit in

Evaluating Goodness of Fit in Evaluating Goodness of Fit in Nonmetric Multidimensional Scaling by ALSCAL Robert MacCallum The Ohio State University Two types of information are provided to aid users of ALSCAL in evaluating goodness

More information

Lecture 2: Data Analytics of Narrative

Lecture 2: Data Analytics of Narrative Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three

More information

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 5 Topic Overview 1) Introduction/Unvariate Statistics 2) Bootstrapping/Monte Carlo Simulation/Kernel

More information

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012 Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing

More information

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data The group numbers are arbitrary. Remember that you can rotate dendrograms around any node and not change the meaning. So, the order of the clusters is not meaningful. Taking a subset of the data changes

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential

More information

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA Last time: PCA Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 10 What is Data? Collection of data objects and their attributes Attributes An attribute is a property

More information

Cluster Analysis CHAPTER PREVIEW KEY TERMS

Cluster Analysis CHAPTER PREVIEW KEY TERMS LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: Define cluster analysis, its roles, and its limitations. Identify the types of research questions addressed by

More information

ECON 214 Elements of Statistics for Economists

ECON 214 Elements of Statistics for Economists ECON 214 Elements of Statistics for Economists Session 8 Sampling Distributions Lecturer: Dr. Bernardin Senadza, Dept. of Economics Contact Information: bsenadza@ug.edu.gh College of Education School of

More information

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES Contents Foreword Preface to the second edition Preface to the fírst edition Acknowledgments xvll xix xxi xxiii PART I PRELIMINARIES CHAPTER 1 Introduction 3 1.1 What Is Data Mining? 3 1.2 Where Is Data

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

The Power of Asymmetry in Binary Hashing

The Power of Asymmetry in Binary Hashing The Power of Asymmetry in Binary Hashing Behnam Neyshabur Yury Makarychev Toyota Technological Institute at Chicago Russ Salakhutdinov University of Toronto Nati Srebro Technion/TTIC Search by Image Image

More information

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional

More information

MULTIVARIATE ANALYSIS OF BORE HOLE DISCONTINUITY DATA

MULTIVARIATE ANALYSIS OF BORE HOLE DISCONTINUITY DATA Maerz,. H., and Zhou, W., 999. Multivariate analysis of bore hole discontinuity data. Rock Mechanics for Industry, Proceedings of the 37th US Rock Mechanics Symposium, Vail Colorado, June 6-9, 999, v.,

More information

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Small vs. large parsimony A quick review Fitch s algorithm:

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 2 Nonlinear Manifold Learning Multidimensional Scaling (MDS) Locally Linear Embedding (LLE) Beyond Principal Components Analysis (PCA)

More information

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering. 1 / 19 sscott@cse.unl.edu x1 If no label information is available, can still perform unsupervised learning Looking for structural information about instance space instead of label prediction function Approaches:

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures

More information

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD Clustering: K-means -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 Clustering Introduction When clustering, we seek to simplify the data via a small(er) number of summarizing variables

More information

MAT 2037 LINEAR ALGEBRA I web:

MAT 2037 LINEAR ALGEBRA I web: MAT 237 LINEAR ALGEBRA I 2625 Dokuz Eylül University, Faculty of Science, Department of Mathematics web: Instructor: Engin Mermut http://kisideuedutr/enginmermut/ HOMEWORK 2 MATRIX ALGEBRA Textbook: Linear

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Module 7-2 Decomposition Approach

Module 7-2 Decomposition Approach Module 7-2 Decomposition Approach Chanan Singh Texas A&M University Decomposition Approach l Now we will describe a method of decomposing the state space into subsets for the purpose of calculating the

More information

luster Analysis F Murtagh 1 Cluster Analysis

luster Analysis F Murtagh 1 Cluster Analysis luster Analysis F Murtagh 1 Cluster Analysis Topics: Example: globular cluster study (PCA and clustering) Metric and distance Hierarchical agglomerative clustering Single link, minimum variance criterion

More information

CS626 Data Analysis and Simulation

CS626 Data Analysis and Simulation CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462, email:kemper@cs.wm.edu Today: Data Analysis: A Summary Reference: Berthold, Borgelt, Hoeppner, Klawonn: Guide to Intelligent

More information