Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Similar documents
Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

What Lies Beneath: A Sub- National Look at Okun s Law for the United States.

Summary of Natural Hazard Statistics for 2008 in the United States

Multivariate Statistics

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Multivariate Statistics

Multivariate Statistics

Meteorology 110. Lab 1. Geography and Map Skills

Multivariate Analysis

SUPPLEMENTAL NUTRITION ASSISTANCE PROGRAM QUALITY CONTROL ANNUAL REPORT FISCAL YEAR 2008

Correction to Spatial and temporal distributions of U.S. winds and wind power at 80 m derived from measurements

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black

New Educators Campaign Weekly Report

MINERALS THROUGH GEOGRAPHY

Standard Indicator That s the Latitude! Students will use latitude and longitude to locate places in Indiana and other parts of the world.

, District of Columbia

Challenge 1: Learning About the Physical Geography of Canada and the United States

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee May 2018

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2017

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2018

Lecture 26 Section 8.4. Mon, Oct 13, 2008

Abortion Facilities Target College Students

Additional VEX Worlds 2019 Spot Allocations

QF (Build 1010) Widget Publishing, Inc Page: 1 Batch: 98 Test Mode VAC Publisher's Statement 03/15/16, 10:20:02 Circulation by Issue

RELATIONSHIPS BETWEEN THE AMERICAN BROWN BEAR POPULATION AND THE BIGFOOT PHENOMENON

Preview: Making a Mental Map of the Region

Chapter. Organizing and Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

Online Appendix: Can Easing Concealed Carry Deter Crime?

FLOOD/FLASH FLOOD. Lightning. Tornado

MINERALS THROUGH GEOGRAPHY. General Standard. Grade level K , resources, and environmen t

Intercity Bus Stop Analysis

Hourly Precipitation Data Documentation (text and csv version) February 2016

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

Course Structure. Statistical Data Mining and Machine Learning Hilary Term Course Aims. What is Data Mining?

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

Sample Statistics 5021 First Midterm Examination with solutions

Printable Activity book

A. Geography Students know the location of places, geographic features, and patterns of the environment.

Multivariate Classification Methods: The Prevalence of Sexually Transmitted Diseases

Osteopathic Medical Colleges

Your Galactic Address

Club Convergence and Clustering of U.S. State-Level CO 2 Emissions

JAN/FEB MAR/APR MAY/JUN

Office of Special Education Projects State Contacts List - Part B and Part C

2005 Mortgage Broker Regulation Matrix

Swine Enteric Coronavirus Disease (SECD) Situation Report June 30, 2016

Alpine Funds 2016 Tax Guide

The Heterogeneous Effects of the Minimum Wage on Employment Across States

Use your text to define the following term. Use the terms to label the figure below. Define the following term.

OUT-OF-STATE 965 SUBTOTAL OUT-OF-STATE U.S. TERRITORIES FOREIGN COUNTRIES UNKNOWN GRAND TOTAL

Crop Progress. Corn Mature Selected States [These 18 States planted 92% of the 2017 corn acreage]

Statistical Methods for Data Mining

Stem-and-Leaf Displays

North American Geography. Lesson 2: My Country tis of Thee

Annual Performance Report: State Assessment Data

(Specification B) (JUN H01) (JaN11GEOG101) General Certificate of Education Secondary Education Advanced Higher TierSubsidiary Examination

Alpine Funds 2017 Tax Guide

Nursing Facilities' Life Safety Standard Survey Results Quarterly Reference Tables

Analysis of the USDA Annual Report (2015) of Animal Usage by Research Facility. July 4th, 2017

Swine Enteric Coronavirus Disease (SECD) Situation Report Sept 17, 2015

High School World History Cycle 2 Week 2 Lifework

PRINCIPAL COMPONENTS ANALYSIS

Multiway Analysis of Bridge Structural Types in the National Bridge Inventory (NBI) A Tensor Decomposition Approach

Swine Enteric Coronavirus Disease (SECD) Situation Report Mar 5, 2015

October 2016 v1 12/10/2015 Page 1 of 10

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

Fungal conservation in the USA

JAN/FEB MAR/APR MAY/JUN

An Analysis of Regional Income Variation in the United States:

Appendix 5 Summary of State Trademark Registration Provisions (as of July 2016)

National Organization of Life and Health Insurance Guaranty Associations

If you have any questions concerning this report, please feel free to contact me. REPORT OF LABORATORY ANALYSIS

Analyzing Severe Weather Data

Insurance Department Resources Report Volume 1

BlackRock Core Bond Trust (BHK) BlackRock Enhanced International Dividend Trust (BGY) 2 BlackRock Defined Opportunity Credit Trust (BHL) 3

If you have any questions concerning this report, please feel free to contact me. REPORT OF LABORATORY ANALYSIS

State Section S Items Effective October 1, 2017

2006 Supplemental Tax Information for JennisonDryden and Strategic Partners Funds

Evolution Strategies for Optimizing Rectangular Cartograms

KS PUBL 4YR Kansas State University Pittsburg State University SUBTOTAL-KS

Direct Selling Association 1

C Further Concepts in Statistics

Pima Community College Students who Enrolled at Top 200 Ranked Universities

Office of Budget & Planning 311 Thomas Boyd Hall Baton Rouge, LA Telephone 225/ Fax 225/

MO PUBL 4YR 2090 Missouri State University SUBTOTAL-MO

Division I Sears Directors' Cup Final Standings As of 6/20/2001

Non-iterative, regression-based estimation of haplotype associations

extreme weather, climate & preparedness in the american mind

EXST 7015 Fall 2014 Lab 08: Polynomial Regression

United States Geography Unit 1

Statistical Mechanics of Money, Income, and Wealth

Forecasting the 2012 Presidential Election from History and the Polls

A Summary of State DOT GIS Activities

Infant Mortality: Cross Section study of the United State, with Emphasis on Education

All-Time Conference Standings

WORKING PAPER NO AN ALTERNATIVE DEFINITION OF ECONOMIC REGIONS IN THE U.S. BASED ON SIMILARITIES IN STATE BUSINESS CYCLES

GIS use in Public Health 1

Transcription:

Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional Scaling Isomap Clustering Introduction Hierarchical Clustering K-means Vector Quantisation Probabilistic Methods

Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional Scaling Isomap Clustering Introduction Hierarchical Clustering K-means Vector Quantisation Probabilistic Methods

Notation Data consists of p measurements (variables/attributes) on n examples (observations/cases) X is a n p-matrix with X ij := the j-th measurement for the i-th example X = X 11 X 12... X 1j... X 1p X 21 X 22... X 2j... X 2p.......... X i1 X i2... X ij... X ip.......... X n1 X n2... X nj... X np

Crabs Data (n = 200, p = 5) Campbell (1974) studied rock crabs of the genus leptograpsus. One species, L. variegatus, had been split into two new species, previously grouped by colour, orange and blue. Preserved specimens lose their colour, so it was hoped that morphological differences would enable museum material to be classified. Data are available on 50 specimens of each sex of each species, collected on sight at Fremantle, Western Australia. Each specimen has measurements on the width of the frontal lip FL, the rear width RW, and length along the midline CL and the maximum width CW of the carapace, and the body depth BD in mm.

Crabs Data Looking at the crabs dataset, n = 200 measurements on p = 5 morphological features of crabs FL frontal lip size (mm) RW rear width (mm) CL carapace length (mm) CW carapace width (mm) BD body depth (mm) Also available, the colour ( B or O ) and sex ( M or F ). ## load package MASS containing the data library(mass) ## look at data crabs sp sex index FL RW CL CW BD 1 B M 1 8.1 6.7 16.1 19.0 7.0 2 B M 2 8.8 7.7 18.1 20.8 7.4...

R code ## assign predictor and class variables Crabs <- crabs[,4:8] Crabs.class <- factor(paste(crabs[,1],crabs[,2],sep="")) ## plot data using pair plots plot(crabs,col=unclass(crabs.class)) ##boxplots boxplot(crabs) ## parallel coordinates parcoord(crabs)

Simple Pairwise Scatterplots 6 8 10 12 14 16 18 20 20 30 40 50 FL 10 15 20 6 8 10 12 14 16 18 20 RW CL 15 20 25 30 35 40 45 20 30 40 50 CW BD 10 15 20 10 15 20 15 20 25 30 35 40 45 10 15 20

Univariate Boxplots 10 20 30 40 50 FL RW CL CW BD

Univariate Histograms Histogram of Frontal Lobe Si Histogram of Rear Width Histogram of Carapace Leng Frequency 0 5 10 15 20 Frequency 0 5 10 15 Frequency 0 5 10 15 20 10 15 20 Frontal Lobe Size (mm) 6 8 12 16 20 Rear Width (mm) 15 25 35 45 Carapace Length (mm) Histogram of Carapace Widt Histogram of Body Depth Frequency 0 5 10 15 20 Frequency 0 5 10 15 20 25 30 20 30 40 50 Carapace Width (mm) 10 15 20 Body Depth (mm)

Parallel Coordinate Plots FL RW CL CW BD

These summary plots are helpful, but do not really help very much if the dimensionality of the data is high (a few dozen or thousands). Possible approaches for higher-dimensional problems. We are constrained to view data in 2 or 3 dimensions Look for interesting projections of X into lower dimensions Hope that for large p, considering only k p dimensions is just as informative

Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional Scaling Isomap Clustering Introduction Hierarchical Clustering K-means Vector Quantisation Probabilistic Methods

Principal Components Analysis (PCA) Seek to rotate data to a new basis that represents the data in a more interesting way. PCA considers interesting to be directions with greatest variance. Builds up an orthogonal basis where new basis vectors are chosen to explain the greatest variance in data, the first few PCs should represent most of the variance-covariance structure in the data, i.e. the subspace spanned by first k PCs represents the best k-dimensional view of the data.

Principal Components Analysis (PCA) Consider a set of real-valued variables X =(X 1...X p ) T. For the 1 st PC, we seek a derived variable of the form Z 1 = a 11 X 1 + a 21 X 2 + + a p1 X p = X T a 1 where a 1i R are chosen to maximise var(z 1 ). To get a well defined problem, we fix a T 1 a 1 = 1. The 1 st PC attempts to capture the common variation in all variables using a single derived variable. The 2 nd PC Z 2 is chosen to be orthogonal with the 1 st and is computed in a similar way. It will have the largest variance in the remaining p 1 dimensions, etc.

Principal Components Analysis (PCA) X2 4 2 0 2 4 4 2 0 2 4 X1

Principal Components Analysis (PCA) X2 4 2 0 2 4 4 2 0 2 4 X1

Principal Components Analysis (PCA) PC_2 4 2 0 2 4 4 2 0 2 4 PC_1

How to Obtain the Coefficients? To find the 1 st PC given by Z 1 = X T a 1 Maximise var(z 1 )=var(xa 1 )=a T 1 cov(x)a 1 a T 1 Sa 1 subject to a T 1 a 1 = 1 where S = n 1 X T X is a p p sample covariance matrix of the centred n p data matrix X. Rewriting this as a constrained maximisation problem, Maximise F (a 1 )=a T 1 Sa 1 λ 1 a T 1 a 1 1 w.r.t. a 1. The corresponding vector of partial derivatives yields F a 1 = 2Sa 1 2λ 1 a 1. Setting this to zero reveals the eigenvector equation, i.e. a 1 must be an eigenvector of S and λ 1 the corresponding eigenvector. Since a T 1 Sa 1 = λ 1 a T 1 a 1 = λ 1, the 1 st PC must be the eigenvector associated with the largest eigenvalue of S.

How to Obtain the Coefficients? How about the 2 nd PC? Proceed as before but include the additional constraint that the 2 nd PC must be orthogonal to the 1 st PC Maximise F (a 2 )=a T 2 Sa 2 λ 2 a T 2 a 2 1 µ a T 1 a 2 w.r.t. a2 Solving this shows that a 2 must be the eigenvector of S associated with the 2 nd largest eigenvalue, and so on The eigenvalue decomposition of S is given by S = AΛA T where Λ is a diagonal matrix with eigenvalues λ 1 λ 2 λ p 0 and A is a p p orthogonal matrix whose columns are the p eigenvectors of S.

Properties of the Principal Components PCs are uncorrelated cov(x T a i, X T a j ) a T i Sa j = 0 for i = j. The total sample variance is given by Total sample variance = p s ii = λ 1 +...+ λ p, i=1 so the proportion of total variance explained by the k th PC is λ k λ 1 + λ 2 +...+ λ p k = 1, 2,...,p

R code This is what we have had before: library(mass) Crabs <- crabs[,4:8] Crabs.class <- factor(paste(crabs[,1],crabs[,2],sep="")) plot(crabs,col=unclass(crabs.class)) Now perform PCA analysis with function princomp. Alternatively, use eigen or svd instead (later). Crabs.pca <- princomp(crabs,cor=false) plot(crabs.pca) pairs(predict(crabs.pca),col=unclass(crabs.class))

PCA Example 1: Original crabs data 6 8 10 12 14 16 18 20 20 30 40 50 FL 10 15 20 6 8 10 12 14 16 18 20 RW CL 15 20 25 30 35 40 45 20 30 40 50 CW BD 10 15 20 10 15 20 15 20 25 30 35 40 45 10 15 20

PCA Example 1: Rotated crabs data 2 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 Comp.1 20 10 0 10 20 30 2 1 0 1 2 3 Comp.2 Comp.3 2 1 0 1 2 1.0 0.5 0.0 0.5 1.0 Comp.4 Comp.5 0.5 0.0 0.5 20 10 0 10 20 30 2 1 0 1 2 0.5 0.0 0.5

PCA Example 1: Crabs Data (n = 200, p = 5) Comp.3 2 1 0 1 2 2 1 0 1 2 3 Comp.2

PCA Example 2: Yeast Cell Cycle Data (n = 384, p = 17) Cho et al (1998) present gene expression data on the cell cycle of yeast. They identify a subset of genes that can be categorised into five different phases of the cell-cycle. Changes in expression for the genes are measured over two cell cycles (17 time points). The data were normalised so that the expression values for each gene has mean zero and unit variance across the cell cycles. We visualise the 384 genes in the space of the first two prinicipal components.

PCA Example 2: Yeast Cell Cycle Data (n = 384, p = 17) Comp.2 4 2 0 2 4 2 0 2 4 Comp.1

Comments on the use of PCA PCA commonly used to project data X onto the first k PCs giving the best k-dimensional view of the data. PCA commonly used for lossy compression of high dimensional data. Emphasis on variance is where the weaknesses of PCA stem from: The PCs depend heavily on the units measurement. Where the data matrix contains measurements of vastly differing orders of magnitude, the PC will be greatly biased in the direction of larger measurement. It is therefore recommended to calculate PCs from cor(x) instead of cov(x). Robustness to outliers is also an issue. Variance is affected by outliers therefore so are PCs. Although PCs are uncorrelated, scatterplots sometimes reveal structures in the data other than linear correlation.

Biplots When viewing projections of data matrix X into its PC space, it is instructive to view the contribution from the original variables to the PCs that are plot. Biplots overlay projection of unit vectors of the original variables into the PC space As PCs are linear combinations of the original variables, it is straightforward to invert this relationship to yield the contributions of the original variables to the PCs

Biplots Biplots show us an image of the data and unit vectors of the original axes into the projected space. The distance of projected points away from the projected original axes tell us its original location. Unit vectors of the original variables give us a common denominator to compare how much weighting each PC gives to the original variables. It can be shown that cos θ (where θ is the angle that subtends two projected original axes) approximates the correlation between these variables. However, the quality of this image depends on the proportion of variance explained by the PCs used.

Biplot Example 1: Fisher s Iris Data 50 sample from 3 species of iris: iris setosa, versicolor, and virginica Each measuring the length and widths of both sepal and petals Collected by E. Anderson (1935) and analysed by R.A. Fisher (1936) Using again function princomp and biplot. iris1 <- iris iris1 <- iris1[,-5] biplot(princomp(iris1,cor=t))

10 5 0 5 10 61 Comp.2 0.2 0.1 0.0 0.1 0.2 42 9 14 39 1346 2 4 26 43 10 31 35 48 30 36 7 50 125 24 84027 29 23 41 21 1 32 18 28 44 38 5 37 22 49 11 47 2045 19 176 33 15 34 Sepal.Width 94 58 5463 120 69 99 82 81107 7090 88 91 114 80 60 93 73 95 147 683 135 102 143 109 100 56 84 122 65 124 977274 115 112 85 134 127 89 9667 79 9864 55 129 133 62 75 77104 92139 150 59 128 119 76 117 148 78 105 131 71 138 146 113 108 123 6687 52 53 130103 11116140 142 141 57 86 106 51 101 121 144 136 149 137 125126 145 110 Petal.Leng Petal.Width Sepal.Length 10 5 0 5 10 16 118 132 0.2 0.1 0.0 0.1 0.2 Comp.1

Biplot Example 2: US Arrests This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. pairs(usarrests) usarrests.pca <- princomp(usarrests,cor=t) plot(usarrests.pca) pairs(predict(usarrests.pca)) biplot(usarrests.pca)

Pairs Plot: US Arrests Murder 50 150 250 10 20 30 40 5 10 15 50 150 250 Assault UrbanPop 30 50 70 90 5 10 15 10 20 30 40 30 50 70 90 Rape

Biplot Example 2: US Arrests 5 0 5 Mississippi North Carolina Comp.2 0.2 0.1 0.0 0.1 0.2 0.3 South Carolina West VirginiaVermont Georgia Alaska Alabama Arkansas Kentucky Murder Louisiana Tennessee South Dakota Montana North Dakota Assault Maryland Wyoming Maine New Mexico Virginia Idaho Florida New Hampshire Michigan Iowa Indiana Nebraska Missouri Kansas Rape Delaware Oklahoma Texas Oregon Pennsylvania Minnesota Wisconsin Illinois Nevada Arizona Ohio New York Colorado Washington Connecticut California New Jersey Massachusetts Utah Rhode Island Hawaii 5 0 5 UrbanPop 0.2 0.1 0.0 0.1 0.2 0.3 Comp.1

Biplot Example 3: US State data This data set contains statistics like illiteracy and life expectancy on 50 US states. data(state) state <- state.x77[, 2:7] row.names(state)<-state.abb state[1:5,] ## load state data ## extract useful info ## lets have a look Income Illiteracy Life Exp Murder HS Grad Frost AL 3624 2.1 69.05 15.1 41.3 20 AK 6315 1.5 69.31 11.3 66.7 152 AZ 4530 1.8 70.55 7.8 58.1 15 AR 3378 1.9 70.66 10.1 39.9 65 CA 5114 1.1 71.71 10.3 62.6 20 ## calculate the pc s of the data and show biplot state.pca <- princomp(state,cor=true) biplot(state.pca,pc.biplot=true,cex=0.8,font=2,expand=0.9)

Biplot: US States 1.0 0.5 0.0 0.5 1.0 Comp.2 2 1 0 1 2 3 Murder Illiteracy GA LA AL SC MS TX TN NC KY AR FL AZ NY CA IL HI MD AK NVWA MI NJ VA OR DE CT CO MA KS OH WY Life Exp MO IN PA NM OK MT IA ID NE UT MN ND RI WI Frost NH WV ME VT SD Income HS Grad 1.0 0.5 0.0 0.5 1.0 2 1 0 1 2 3 Comp.1