Louis Roussos Sports Data

Size: px

Start display at page:

Download "Louis Roussos Sports Data"

Clare Joseph
5 years ago
Views:

1 Louis Roussos Sports Data Rank the sports you most like to participate in, 1 = favorite, 7 = least favorite. There are n=130 rank vectors. > sportsranks Baseball Football Basketball Tennis Cycling Swimming Jogging [...]

2 K-means in R Set #Clusters = K = centers. nstart is the number of times it runs the algorithm, each time using a diferent random starting set of means. > kmeans(sportsranks,centers=2,nstart=10) K means clustering with 2 clusters of sizes 62, 68 Cluster means: Baseball Football Basketball Tennis Cycling Swimming Jogging Clustering vector: Within cluster sum of squares by cluster: [1] Available components: [1] cluster centers withinss size

3 Getting clusters of size K=2,..., 10 kms < vector( list,10) for(k in 2:10) { kms[[k]] < kmeans(sportsranks,centers=k,nstart=10) }

4 K = 1 BaseB FootB BsktB Ten Cyc Swim Jog Group K = 2 BaseB FootB BsktB Ten Cyc Swim Jog Group Group K = 3 BaseB FootB BsktB Ten Cyc Swim Jog Group Group Group K = 4 BaseB FootB BsktB Ten Cyc Swim Jog Group Group Group Group K = 2: Group 1 likes swimming and cycling, while group 2 likes the team sports, baseball, football, and basketball. K = 3: Group 1 appears to be about the same is the team sports group from K = 2, while groups 2 and 3 both like swimming and cycling. The difference is that group 3 does not like jogging, while group 2 does. K = 4: The team-sports group has split into one that likes tennis (group 3), and one that doesn t (group 2).

5 Plotting two clusters The idea is to project the observations to the subspace (which is just a line) that goes through the two clusters mean vectors. The z = µ 1 µ 2 µ 1 µ 2, is the unit vector pointing from µ 2 to µ 1. Then using z as an axis, the projections of the observations onto z have coordinates w i = x i z, i = 1,..., N.

6 The histogram K=2 Frequency Basketball Jogging Football Swimming Baseball Cycling Tennis X X W

7 Plot for K=3 If K = 3, then the three means lie in a plane, hence we would like to project the observations onto that plane. One approach is to use principal components on the means: Z = we apply the spectral decomposition to the sample covariance matrix of Z: 1 3 Z H 3 Z = GLG, (1) where G is orthogonal and L is diagonal. The diagonals of L here are 11.77, 4.07, and five zeros. We then rotate the data and the means using G, µ 1 µ 2 µ 3, W = XG and W (means) = ZG, Only the first two columns in each matrix are relevant.

8 The Plot K=3 Var Jogging 2 Baseball Football Tennis Basketball Cycling Swimming Var 1

9 The sums of squares SS K SS K = obj( µ 1,..., µ K ) = K k=1 {i y i =k} x i µ k 2.

10 The reduction of sums of squares 1-SS[k]/SS[k-1] K 1 SS K SS K 1

11 Silhouettes in R The function silhouette.km finds the silhouettes for a given clustering, then sort.silhouette orders them, first by cluster number, then by value. To plot the sillhouettes for k = 2,..., 10: sil.ave < NULL # To collect silhouette s means for each K par(mfrow=c(3,3)) for(k in 2:10) { sil < silhouette.km(sportsranks,kms[[k]]$centers) sil.ave < c(sil.ave,mean(sil)) ssil < sort.silhouette(sil,kms[[k]]$cluster) plot(ssil,type= h,xlab= Observations,ylab= Silhouettes ) title(paste( K =,K)) } The sil.ave calculated above can then be used to obtain the plot of averages: plot(2:10,sil.ave,type= l,xlab= K,ylab= Average silhouette width )

12 Plotting the silhouettes K = 2 K = Ave = K = Ave = K = Ave = Ave = 0.534

13 Plotting the silhouettes averages Average silhouette width K K = 2 seems like a good choice.

14 Model-based clustering Car data The data consists of size measurements on 111 automobiles, the variables include length, wheelbase, width, height, front and rear head room, front leg room, rear seating, front and rear shoulder room, and luggage area. The data are in the file cars. The variables have been normalized to have medians of 0 and median absolute deviations (MAD) of (the MAD for a N(0, 1)).

15 R for model-based clustering The R function we use is in the package mclust. The function is Mclust. The basic command is simple: mcars < Mclust(cars) There are many options for plotting in the package. To see a plot of the BIC s, use plot(mcars,cars,what= BIC ) You have to clicking on the graphics window, or hit enter, to reveal the plot. Not that the BIC s in this function are actually the BIC s. So we want to maximize it.

16 Plotting the BIC s BIC EII VII EEI VEI EVI VVI EEE EEV VEV VVV number of components K = 2, VVV is best.

17 What is VVV? To find the name of the best model: > mcars best model: ellipsoidal, unconstrained with 2 components That K = 2 is easy to see. The assumptions on the covariance matrices are ellipsoidal, which means they have no special structure, and unconstrained, which means they are not assumed equal for the two groups, Σ 1 = Σ 2. To plot variable 1 (length) versus variable 4 (height), use plot(mcars,cars,what= classification,dimens=c(1,4))

18 Plotting the clusters Height FrtLegRoom Length Width Luggage PC RearHd PC1

19 The cars in group 2 Rear Head Rear Seating Rear Shoulder Luggage Chevrolet Corvette Honda Civic CRX Mazda MX5 Miata Mazda RX Nissan 300ZX Chevrolet Astro Chevrolet Lumina APV Dodge Caravan Dodge Grand Caravan Ford Aerostar Mazda MPV Mitsubishi Wagon Nissan Axxess Nissan Van Volkswagen Vanagon

20 Just group 1 Redo on just the group 1 automobiles: cars1 < cars[mcars$classification==1,] mcars1 < Mclust(cars1) mcars1 best model: elliposidal multivariate normal with 1 components The best is one big cluster.

21 The models in mclust Code Description Σ k EII spherical, equal volume σ 2 I p VII spherical, unequal volume σk 2I p EEI diagonal, equal volume and shape Λ VEI diagonal, varying volume, equal shape c k EVI diagonal, equal volume, varying shape c k VVI diagonal, varying volume and shape Λ k EEE ellipsoidal, equal volume, shape, and orientation Σ EEV ellipsoidal, equal volume and equal shape Γ k ΛΓ k VEV ellipsoidal, equal shape c k Γ k Γ k VVV ellipsoidal, varying volume, shape, and orientation arbitrary Here, Λ s are diagonal matrices with positive diagonals, s are diagonal matrices with positive diagonals whose product is 1, Γ s are orthogonal matrices, Σ s are arbitrary nonnegative definite symmetric matrices, and c s are positive scalars. A subscript k on an element means the groups can have different values for that element. No subscript means that element is the same for each group.

22 Hierarchical clustering of the sports plclust(hclust(dist(t(sportsranks)))) Height Baseball Football Basketball Jogging Tennis Cycling Swimming Complete linkage

23 Hierarchical clustering of the individuals par(mfrow=c(2,1)) dxs < dist(sportsranks) # Gets Euclidean distances lbl < rep(,130) # Prefer no labels for the individuals plclust(hclust(dxs),xlab= Complete linkage,sub=,labels=lbl) plclust(hclust(dxs,method= single ),xlab= Single linkage,sub=,labels=lbl) Height Complete linkage Height Single linkage

Solution to Series 7

Solution to Series 7 Prof. r. M. Maathuis Multivariate tatistics 2014 olution to eries 7 1. a) Computing the 2 clusters with the K-means method. > set.seed(10) > kmean.bank