Data Mining: Concepts and Techniques

Similar documents
The Simple Linear Regression Model: Theory

Econometric Methods. Review of Estimation

Simple Linear Regression

Basics of heteroskedasticity

Simulation Output Analysis

Outline. Point Pattern Analysis Part I. Revisit IRP/CSR

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

MEASURES OF DISPERSION

CHAPTER VI Statistical Analysis of Experimental Data

f f... f 1 n n (ii) Median : It is the value of the middle-most observation(s).

Chapter 8: Statistical Analysis of Simulated Data

Goal of the Lecture. Lecture Structure. FWF 410: Analysis of Habitat Data I: Definitions and Descriptive Statistics

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

PTAS for Bin-Packing

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Statistics MINITAB - Lab 5

Special Instructions / Useful Data

Lecture 2. Basic Semiconductor Physics

Sampling Theory MODULE X LECTURE - 35 TWO STAGE SAMPLING (SUB SAMPLING)

Chapter 14 Logistic Regression Models

Random Variables and Probability Distributions

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

ENGI 3423 Simple Linear Regression Page 12-01

22 Nonparametric Methods.

Lecture 3 Probability review (cont d)

Chapter -2 Simple Random Sampling

Summary of the lecture in Biostatistics

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Statistics Descriptive and Inferential Statistics. Instructor: Daisuke Nagakura

Convergence of the Desroziers scheme and its relation to the lag innovation diagnostic

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Chapter -2 Simple Random Sampling

Random Variate Generation ENM 307 SIMULATION. Anadolu Üniversitesi, Endüstri Mühendisliği Bölümü. Yrd. Doç. Dr. Gürkan ÖZTÜRK.

Lecture 9: Tolerant Testing

F. Inequalities. HKAL Pure Mathematics. 進佳數學團隊 Dr. Herbert Lam 林康榮博士. [Solution] Example Basic properties

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

Lecture 1 Review of Fundamental Statistical Concepts

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

STA302/1001-Fall 2008 Midterm Test October 21, 2008

X ε ) = 0, or equivalently, lim

Linear Regression with One Regressor

is the score of the 1 st student, x

4. Standard Regression Model and Spatial Dependence Tests

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Logistic regression (continued)

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Chapter 5 Properties of a Random Sample

Median as a Weighted Arithmetic Mean of All Sample Observations

A Combination of Adaptive and Line Intercept Sampling Applicable in Agricultural and Environmental Studies

Continuous Distributions

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

The expected value of a sum of random variables,, is the sum of the expected values:

Descriptive Statistics

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

Investigating Cellular Automata

ESS Line Fitting

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Machine Learning. Topic 4: Measuring Distance

Analysis of Variance with Weibull Data

Module 7. Lecture 7: Statistical parameter estimation

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Lecture 07: Poles and Zeros

2. Independence and Bernoulli Trials

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

i 2 σ ) i = 1,2,...,n , and = 3.01 = 4.01

Lecture Notes Types of economic variables

Laboratory I.10 It All Adds Up

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

Ch5 Appendix Q-factor and Smith Chart Matching

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Chapter 4 Multiple Random Variables

Lecture 3. Sampling, sampling distributions, and parameter estimation

Sequential Approach to Covariance Correction for P-Field Simulation

Comparing Different Estimators of three Parameters for Transmuted Weibull Distribution

Statistics Descriptive

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

We have already referred to a certain reaction, which takes place at high temperature after rich combustion.

The fuzzy decision of transformer economic operation

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

Functions of Random Variables

Measures of Dispersion

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Generalization of the Dissimilarity Measure of Fuzzy Sets

Comparison of Parameters of Lognormal Distribution Based On the Classical and Posterior Estimates

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

BIOREPS Problem Set #11 The Evolution of DNA Strands

8.1 Hashing Algorithms

1. BLAST (Karlin Altschul) Statistics

Multiple Linear Regression Analysis

1 Onto functions and bijections Applications to Counting

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

SPECIAL CONSIDERATIONS FOR VOLUMETRIC Z-TEST FOR PROPORTIONS

Transcription:

Data Mg: cepts ad Techques 3 rd ed. hapter 10 1 Evaluat f lusterg lusterg evaluat assesses the feasblty f clusterg aalyss a data set ad the qualty f the results geerated by a clusterg methd. Three mar tasks f clusterg evaluat: Assessg the clusterg tedecy Whether a radm structure exsts the data Determg the umber f clusterg a data set Measurg clusterg qualty 2 1

Assessg lusterg Tedecy lusterg requres ufrm dstrbut f data Assess f -radm structure exsts the data by measurg the prbablty that the data s geerated by a ufrm data dstrbut Hpks Statstc Test spatal radmess Ths statstc exames whether bects a data set dffer sgfcatly frm the assumpt that they are ufrmly dstrbuted the multdmesal space It cmpares the dstaces p betwee the real bects ad ther earest eghbrs t the dstaces q betwee artfcal bects ufrmly geerated ver the data space ad ther earest real eghbrs. Gve a dataset D regarded as a sample f a radm varable determe hw far away s frm beg ufrmly dstrbuted the data space 3 Hpks Statstc Idex alculate the Hpks Statc Idex Sample pts p 1 p frm D. Each pt has the same prbablty f beg cluded the sample. Fr each p fd ts earest eghbr D: x = m{dst p v} where v D Sample pts q 1 q ufrmly frm D. Fr each q fd ts earest eghbr D {q }: y = m{dst q v} where v D ad v q x 1 alculate the Hpks Statstc: H x y If D s ufrmly dstrbuted x ad y wll be clse t each ther ad H s clse t 0.5. If clusterg are preset the dstaces fr artfcal bects x wll be larger tha fr the real es y H s clse t 1. because these artfcal bects are hmgeeusly dstrbuted whereas the real es are gruped tgether ad the value f H wll crease. 1 1 4 2

Examples Ope crcles represet real bects clsed crcles selected real bects ad astersks represet artfcal bects geerated ver the data space a H value = 0.49 b H value = 0.73 a b 5 Determe the Number f lusters 1 Emprcal methd # f clusters: k /2 fr a dataset f pts e.g. = 200 k = 10 Elbw methd Gve a umber k>0 we ca frm k clusters the data set usg a cluster algrthm lke k-meas. alculate the sum f wth-cluster varace vark Plt the curve f var wth respect t k. The frst turg pt the curve suggests the rght umber 6 3

Determe the Number f lusters 2 rss valdat methd Dvde a gve data set t m parts Use m 1 parts t bta a clusterg mdel Use the remag part t test the qualty f the clusterg E.g. Fr each pt the test set fd the clsest cetrd ad use the sum f squared dstace betwee all pts the test set ad the clsest cetrds t measure hw well the mdel fts the test set Fr ay k > 0 repeat t m tmes calculate the average qualty measure as the verall qualty measure mpare the verall qualty measure w.r.t. dfferet values f k ad fd # f clusters that fts the data the best 7 Measurg lusterg Qualty Exteral: supervsed emply crtera t heret t the dataset mpare a clusterg agast prr r expert-specfed kwledge.e. the grud truth usg certa clusterg qualty measure Iteral: usupervsed crtera derved frm data tself Evaluate the gdess f a clusterg by csderg hw well the clusters are separated ad hw cmpact the clusters are e.g. Slhuette ceffcet 8 4

Measurg lusterg Qualty: Exteral Methds lusterg qualty measure: Q T fr a clusterg gve the grud truth T Q s gd f t satsfes the fllwg 4 essetal crtera luster hmgeety: the purer the better luster cmpleteess: shuld assg bects belg t the same categry the grud truth t the same cluster Rag bag: puttg a hetergeeus bect t a pure cluster shuld be pealzed mre tha puttg t t a rag bag.e. mscellaeus r ther categry Small cluster preservat: splttg a small categry t peces s mre harmful tha splttg a large categry t peces 9 Bubed Precs ad Recall Metrcs The precs f a bect dcates hw may ther bects the same cluster belg t the same categry as the bect. The recall f a bect reflects hw may bects f the same categry are assged t the same cluster. 10 5

6 Bubed Precs ad Recall Metrcs Let D={ 1 } be the set f bects ad be a clusterg D. Let L be the categry f gve by grud truth ad be the cluster_id f. Fr tw bects ad 1 the crrectess f the relat betwee ad clusterg s gve by rrectess = 1 f L = L = rrectess = 0 therwse 11 Bubed Precs ad Recall Metrcs Bcubed precs s defed as Bcubed recall s defed as 12 rrectess precs 1 : L L rrectess recall L L 1 :

Itrsc Methds 1 Itrsc methds evaluate a clusterg by examg hw well the clusters are separated ad hw cmpact the clusters are. Grud truth are t avalable The slhuette ceffcet measure Fr a data set D wth bects D s partted t k clusters 1 2 k. Fr each bect we calculate the average dstace betwee ad all ther bects the cluster t whch belgs. Suppse 1 k a ' ' c dst ' 1 13 Itrsc Methds 2 Smlarly we calculate the mmum average dstace frm t all clusters t whch des t belg. b m :1 k dst ' The slhuette ceffcet f s defed as ' b a s max a b The values f the slhuette ceffcet s betwee -1 ad 1 14 7

Itrsc Methds 3 a reflects the cmpactess f the cluster The smaller the value the mre cmpact the cluster b captures the degree t whch s separated frm ther clusters. Whe s appraches 1 the cluster ctag s cmpact ad s far away frm ther clusters whch s preferable. T measure a cluster s ftess wth a clusterg we ca cmpute the average slhuette ceffcet value f all bects the cluster. T measure the qualty f a clusterg cmpute the average slhuette ceffcet value f all bects the data set. 15 Summary luster aalyss grups bects based ther smlarty ad has wde applcats Measure f smlarty ca be cmputed fr varus types f data lusterg algrthms ca be categrzed t parttg methds herarchcal methds desty-based methds grd-based methds ad mdel-based methds K-meas ad K-medds algrthms are ppular parttg-based clusterg algrthms Qualty f clusterg results ca be evaluated varus ways 16 8