Cluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria.

Similar documents
Hypothesis Testing hypothesis testing approach

Dr. Maddah ENMG 617 EM Statistics 10/12/12. Nonparametric Statistics (Chapter 16, Hines)

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Data Mining and Analysis: Fundamental Concepts and Algorithms

Systems Simulation Chapter 7: Random-Number Generation

Visual interpretation with normal approximation

Master s Written Examination - Solution

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Summary of Chapters 7-9

Module Master Recherche Apprentissage et Fouille

Statistical Inference

A Bayesian Criterion for Clustering Stability

LINEAR SYSTEMS (11) Intensive Computation

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Search Algorithms. Analysis of Algorithms. John Reif, Ph.D. Prepared by

Partitioning the Parameter Space. Topic 18 Composite Hypotheses

Slides 3: Random Numbers

An Introduction to Multivariate Statistical Analysis

Nonparametric hypothesis tests and permutation tests

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing

TUTORIAL 8 SOLUTIONS #

Hypothesis testing (cont d)

8: Hypothesis Testing

Basic Concepts in Matrix Algebra

f (1 0.5)/n Z =

BIOS 312: Precision of Statistical Inference

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Performance Evaluation and Comparison

Computational functional genomics

Choice-Based Revenue Management: An Empirical Study of Estimation and Optimization. Appendix. Brief description of maximum likelihood estimation

Central Limit Theorem ( 5.3)

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

2008 Winton. Statistical Testing of RNGs

Regularized Discriminant Analysis and Reduced-Rank LDA

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

B.N.Bandodkar College of Science, Thane. Random-Number Generation. Mrs M.J.Gholba

Data Preprocessing. Cluster Similarity

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

MULTIPLE TESTING PROCEDURES AND SIMULTANEOUS INTERVAL ESTIMATES WITH THE INTERVAL PROPERTY

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

Heuristics for The Whitehead Minimization Problem

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Hypothesis Testing. BS2 Statistical Inference, Lecture 11 Michaelmas Term Steffen Lauritzen, University of Oxford; November 15, 2004

Machine Learning, Fall 2012 Homework 2

L11: Pattern recognition principles

[y i α βx i ] 2 (2) Q = i=1

Chapter 12: Estimation

Statistical Pattern Recognition

Unsupervised machine learning

Chapter 14. Linear least squares

MIXTURE MODELS AND EM

Notes on Linear Algebra and Matrix Theory

The One-Way Repeated-Measures ANOVA. (For Within-Subjects Designs)

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Multivariate Statistics: Hierarchical and k-means cluster analysis

Hypothesis Testing for Var-Cov Components

Master s Written Examination

Experimental Design and Data Analysis for Biologists

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

UNIT 5:Random number generation And Variation Generation

Statistical Pattern Recognition

Class 24. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Introduction to Statistical Inference

Statistical Inference. Hypothesis Testing

Spring 2012 Math 541B Exam 1

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Computational Systems Biology: Biology X

Machine Learning (BSMC-GA 4439) Wenke Liu

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

Testing a Normal Covariance Matrix for Small Samples with Monotone Missing Data

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Session 3 The proportional odds model and the Mann-Whitney test

Chapter 7. Hypothesis Testing

Ch. 5 Hypothesis Testing

Math 1060 Linear Algebra Homework Exercises 1 1. Find the complete solutions (if any!) to each of the following systems of simultaneous equations:

Introduction to Probability and Stocastic Processes - Part I

Algorithmisches Lernen/Machine Learning

MATH5745 Multivariate Methods Lecture 07

Ep Matrices and Its Weighted Generalized Inverse

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Data Mining and Analysis: Fundamental Concepts and Algorithms

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Conditioning Nonparametric null hypotheses Permutation testing. Permutation tests. Patrick Breheny. October 5. STA 621: Nonparametric Statistics

Secretary Problems. Petropanagiotaki Maria. January MPLA, Algorithms & Complexity 2

On Improving the k-means Algorithm to Classify Unclassified Patterns

Hypothesis testing in multilevel models with block circular covariance structures

Prepared by: M. S. KumarSwamy, TGT(Maths) Page

1 Basic Concept and Similarity Measures

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

Lecture 7: Hypothesis Testing and ANOVA

I = i 0,

Minimum Error Rate Classification

Transcription:

1 Testing Oct. 28, 2010

2 Testing Testing

Agenda 3 Testing Review of Testing Testing

Review of Testing 4 Test a parameter against a specific value Begin with H 0 and H 1 as the null and alternative hypotheses Power function: W (θ) = P(qɛD p θɛθ 1 ) Testing Goal: make correct decision

Testing in 5 Test whether the data of X possess a random structure First step: generate data to model a random structure Second step: define a statistic and compare results from our data set and a reference set Three methods exist to generate the population under the randomness hypothesis Choose best method for the situation Testing

6 Suitable for ratio data Requirement: All the arrangements of N vectors in a specific region of the l-dimensional space are equally likely to occur. This can be accomplished with random insertion of points in the region according to uniform distribution Can be used with internal or external criteria Testing

External Criteria 7 Impose clustering algorithm on X a priori based on intuitions Evaluate resulting clustering structure in terms of independently drawn structure Testing

Internal Criteria 8 Evaluate clustering structure in terms of vectors in X Example: proximity matrix Testing

9 Suitable when only internal information is available Definition: NxN matrix A as symmetric matrix with zero diagonal elements A(i,j) only gives information about dissimilarity between x i and x j Thus comparing dissimilarities is meaningless Testing

, cont d 10 Let A i be an NxN rank order without ties Reference population consists of matrices A i generated by randomly iserted integers in the range [1, N(N 1) 2 ] H 0 rejected if q is too large or too small Testing

11 Consider all possible partitions, P of x in m groups Assume that all possible mappings are equally likely Statistic q can be defined to measure degree information in X matches specific partition Use q to test degree of match between P and P against q i s corresponding to random partitions H 0 rejected if q is too large or too small Testing

12 To choose the best parameters A for a specific clustering algorithm to best fit the data set X Parameter set A the cluster size estimation m the initial estimates of parameter vectors related with each cluster Testing

Method I cluster size m is not pre-determined in the algorithm criteria: the clustering structure is captured by a wide range of A 13 Testing Figure: (a) 2-D clusters from normal distributions with mean [0, 0] T, [8, 4] T and [8, 0] T, covariance matrices 1.5I. (b) clustering result (cluster size m) with binary morphology algorithm, with respect of different resolution parameters r

Method I (Cont ) Comparing by data set with a wider range of variance: 14 Testing Figure: (a) 2-D clusters from normal distributions with mean [0, 0] T, [8, 4] T and [8, 0] T, covariance matrices 2.5I. (b) clustering result (cluster size m) with binary morphology algorithm, with respect of different resolution parameters r

Method II 15 cluster size m is pre-determined in the algorithm criteria: to choose the best clustering index q in the range of [m min, m max ] if q shows no trends with respect of m, vary parameter A for each m, choose the best A if q shows trends with respect of m, choose m where significant local change of q happens Testing

Method II (cont ) 16 Testing Figure: data set generated from 4 well-separated normal distributions (feature size l {2, 4, 6, 8}) (a) N = 50 (b) N = 100 (c) N = 150 (d) N = 200. The sharp turns indicate the clustering structure

Method II (cont ) 17 Testing Figure: data set generated from 4 poorly-separated uniformed distributions (feature size l {2, 4, 6, 8}) (a) N = 50 (b) N = 100 (c) N = 150 (d) N = 200. No sharp turn exhibited

Indices The modified Hubert Γ statistic: correlation between proximity matrix P and cluster distance matrix Q P(i, j) = d(xi, x j ), Q(i, j) = d(c xi, c xj ) N 1 Γ = (1/M) N i=1 j=i+1 The Dunn and Dunn-like indices dissimilarity function between two clusters: d(c i, C j ) = min x Ci,y C j d(x, y) diameter of a cluster C: diam(c) = max x,y C d(x, y) Dunn index: D m = min min i=1,...,m j=i+1,...,m X (i, j)y (i, j) (1) d(c i, C j ) max k=1,...,m diam(c k ) (2) 18 Testing

Indices (Cont ) The Davies-Bouldin(DB) and DB-like indices: s i is the measure of the spread around its mean vector for cluster C i dissimilarity function between two clusters: d(ci, C j ) the similarity index Rij between C i, C j has the property: if s j > s k and d ij = d ik then R ij > R ik if s j = s k and d ij < d ik then R ij > R ik choose Rij = s i +s j d ij, R i = max j=1,..m,j i R ij DB m = 1 m m R i (3) i=1 19 Testing The DB-like indices based on MST R ij = smst i DB MST m +s MST j d ij = 1 m m i=1 RMST i

Indices (Cont ) The silhouette index ai is average distance between x i and the rest elements of the cluster C i a i = d ps avg (x i, C x i ) (4) bi is average distance between x i and its closest cluster C k b i = min davg ps (x i, C k ) (5) k=1,...,m,k C i 20 Testing the silhouette width of x i s i = b i a i max(b i, a i ) S j = 1 n j i:x i C j s i, S m = 1 m m j S j (6)

Indices (Cont ) 21 The Gap indices: sum of distance between all pairs within the same cluster: D q = d(x i, x j ) (7) x j C q Wm = m q=i 1 2n q D q x i C q for each m, n data set X r m, r = 1,..., n are generated, the estimated size of cluster is obtained by maximizing: Testing Gap n (m) = E n (log(w r m)) log(w m ) (8)

Indices (Cont ) Information theory based criteria: criteria function C(θ, K) = 2L(θ) + φ(k) (9) L(θ) is the log-likelihood function K is the order of the model - dimentionality of θ, φ is an increasing function of K K is strictly increasing function of m 22 Testing K(m, l) = (l + l(l + 1) 2 + 1)m 1; (10) the goal is to minimize C with respect to θ and K

References 23 S. Theodoridis and K. Koutroumbas. (2009). Pattern Recognition (4th edition), Academic Press. Testing

24 Testing