Prof. Dr. J. Franke All of Statistics 3.1 Multivariate Analysis High dimensional data X 1,..., X N, i.i.d. random vectors in R p. As a data matrix X: objects values of p features 1 X 11 X 12... X 1p 2. X 21. X 22.... X 2p. N X N1 X N2... X Np X T j = (X j1,..., X jp ) = j th row of X. Many statistical problems and procedures (estimators, tests,...) similar to dimension 1, but some specific for multivariate data, e.g. dimension reduction, finding few relevant features,...
Prof. Dr. J. Franke All of Statistics 3.2 Ideally: X 1,..., X N i.i.d. multivariate normal N p (µ, Σ) with mean vector µ and covariance matrix Σ µ k = EX jk, Σ kl = cov(x jk, X jl ), k, l = 1,..., p. Parameter estimates ˆµ = X N = 1 N Nj=1 X j ˆΣ = S = 1 N N j=1 (X j X N )(X j X N ) T, i.e. S kl = 1 N N j=1 (X jk ˆµ k )(X jl ˆµ l ) S is symmetric, all eigenvalues 0. Let d 1 d 2... d p 0 be the eigenvalues of S S = ODO T, D = diag(d 1,..., d p )
Prof. Dr. J. Franke All of Statistics 3.3 S = ODO T, D = diag(d 1,..., d p ) O orthogonal matrix (basis transformation), columns of O = orthogonal basis of eigenvectors of S. Analogously: δ 1 δ 2... δ p 0 eigenvalues of Σ = diag(δ 1,..., δ p ) = Ω T ΣΩ Σ = Ω Ω T Principal Component Analysis (PCA) X 0 representative of X 1,..., X N principal component transformation: X 0 W = Ω T (X 0 µ) k th principal component of X 0 : W k = X 0 µ, e k where e 1,..., e p = normed eigenvectors of Σ = columns of Ω
Prof. Dr. J. Franke All of Statistics 3.4 W = p k=1 W ke k representation of X 0 µ in eigenbasis of Σ. Properties: EW k = 0, var W k = δ k, cov(w k, W l ) = 0, k l var W 1 var W 2... var W p If X 0 N p (µ, Σ)-distributed, then W N p (0, )-distributed, and, hence, W 1,..., W p independent! var W 1 = max{var U; U = p k=1 α k X 0k, α 1,..., α p R, p k=1 α 2 k = 1} Idea of principal component analysis (dimension reduction): Find q p linear combinations of X 01,..., X 0p which explain a large percentage of the variability in the features.
Prof. Dr. J. Franke All of Statistics 3.5 Solution: W 1,..., W q Open questions: µ, Σ =?, q =? Empirical principal components Sample X 1,..., X N i.i.d., sample covariance matrix S = ODO T, sample mean ˆµ = X N principal component transformation: X j V j = O T (X j X N ) Use features V j1,..., V jq instead of X j1,..., X jp, j = 1,..., N. Selection of q : d 1 d 2... d p eigenvalues of S. The proportion of variability of X 0 explained by W 1,..., W q is δ 1 +... + δ q, estimated by d 1 +... + d q.
Prof. Dr. J. Franke All of Statistics 3.6 scree graph: plot d k or d k p l=1 d l against k choose that q where the graph becomes flat. Rules of thumb: a) Choose q s.th. 90% of total variability explained: q = min{r p; d 1 +... + d r 0.9 p k=1 d k = 0.9 tr S} b) (Kaiser) Consider only principal components with above average variance: q = max{r p; d r 1 p p k=1 d k }
Prof. Dr. J. Franke All of Statistics 3.7 N = 180 pit props cut from Corsican pine (Jeffreys, 1967) Goal (regression): Y j = maximum compressive strength as a function of p = 13 predictor variables: 1: top diameter 2: length 3: moisture content (% of dry weight) 4: specific gravity at test 5: oven-dry specific gravity of timber 6: no. of annual rings: at top 7:...: at base 8: maximum bow 9: distance top to point of maximum bow 10: no. of knot whorls 11: length of clear prop from top 12: average no. of knots per whorl 13: average diameter of knots Principal component analysis for X j1,..., X jp, j = 1,..., N
Prof. Dr. J. Franke All of Statistics 3.8 scree graph (k, d k ) for Corsican pitprop data
Prof. Dr. J. Franke All of Statistics 3.9 Eigenvector w.r.t. d 1 : e 1 = ( 0.40, 0.41, 0.12, 0.17, 0.06, 0.28, 0.40, 0.29, 0.36, 0.38, 0.01, 0.12, 0.11) T V j1 average of X jk, k = 1, 2, 6-10 total size of pit prop V j2 average of X jk, V j3 average of X jk, k = 3, 4 degree of seasoning k = 4-7 speed of growth V j4 X j11 length of clear prop from top V j5 X j12 average no. of knots per whorl V j6 average of X jk, k = 5, 13 Rules of thumb: a) q = 6 b) q = 4 Scree graph q = 3 or q = 6
Prof. Dr. J. Franke All of Statistics 3.10 Discriminant Analysis Classification problem: object from class C 1,..., C m Observed: feature vector X 0 Ass.: X 0 has density f k (x) if object from C k Bayes classifier: decide for class C k if X 0 {x; f k (x) = max f i(x)} (1) i=1,...,m Gaussian case: if object from C k, X 0 is N p (µ k, Σ)-distributed. Then, (1) is equivalent to X 0 µ k 2 Σ = (X 0 µ k ) T Σ 1 (X 0 µ k ) =. Σ = Mahalanobis distance w.r.t. Σ min i=1,...,m X 0 µ i 2 Σ
Prof. Dr. J. Franke All of Statistics 3.11 Special case m = 2 : decide for C 1 if α T (X 0 µ) > 0 with µ = 1 2 (µ 1 + µ 2 ), α = Σ 1 (µ 1 µ 2 ). In practice, µ 1,..., µ m, Σ unknown. Training set X (k) j, j = 1,..., n k, i.i.d. N p (µ k, Σ), k = 1,..., m, with known classification sample means and sample covariance matrices for each subsample ˆµ k = 1 n k n k j=1 X (k) j, S (k), k = 1,..., m Combine S (1),..., S (m) to estimate of Σ S = 1 N m k=1 n k S (k), N = n 1 +... + n m
Prof. Dr. J. Franke All of Statistics 3.12 empirical classification rule: decide for C k if X 0 ˆµ k 2 S = min i=1,...,m X 0 ˆµ i 2 S Warning: If µ 1 =... = µ m, classification meaningless, but ˆµ 1,..., ˆµ m not equal. Safeguard: Test H 0 : µ 1 =... = µ m (multivariate ANOVA) Fisher s discriminant rule No Gaussian assumption consider only linear discriminant functions, i.e. decide for C k if a T (X 0 ˆµ k ) < a T (X 0 ˆµ i ) for all i k.
Prof. Dr. J. Franke All of Statistics 3.13 a T (X 0 ˆµ k ) < a T (X 0 ˆµ i ) for all i k. Choose a s.th. the discriminant function shows maximal differences between the groups (of a training set) a is an eigenvector w.r.t. largest eigenvalue of S 1 B where B = X N = 1 N m k=1 m n k (ˆµ k X N )(X N ˆµ k ) T, k=1 n kˆµ k = 1 N m n k k=1 j=1 X (k) j. For m = 2, Fisher s rule coincides with Bayes classification for Gaussian data. For m > 2, usually not.
Prof. Dr. J. Franke All of Statistics 3.18 Cluster analysis Discriminant analysis: Given classes C 1,..., C m. Find classification rule. Estimate it based on training set (supervised learning). Cluster analysis: Find appropriate classes! No information about which object belongs to which group (unsupervised learning). Observed feature vectors X 1,..., X N R p, independent. Assume, e.g., that all X j are N p (µ k, Σ k )-distributed if X j belongs to class C k for some unknown µ k, Σ k, k = 1,..., m, and unknown m, C 1,..., C m.
Prof. Dr. J. Franke All of Statistics 3.19 Maximum likelihood over µ k, Σ k, C 1,..., C m and m in principle possible with penalty for large m to avoid overfitting computationally not feasible usually. Hierarchical clustering algorithms Needed: distance between objects i, j common choice: Pearson distance of feature vectors d 2 ij = p k=1 s 2 k = 1 N 1 (X ik X jk ) 2, N s 2 k j=1(x jk ˆµ k ) 2, ˆµ k = 1 N N X jk j=1 Agglomerative: Start with maximal number of clusters, i.e. each object is an own cluster: C (0) j = {j}, j = 1,..., N.
Prof. Dr. J. Franke All of Statistics 3.20 Nearest neighbour single linkage (nnsl) sort d ij, i < j d r1 s 1 d r2 s 2... 1) N 1 clusters C (0) r 1 + C (0) s 1, C (0) j, j r 1, s 1 2) If r 1, s 1 r 2, s 2 N 2 clusters {r 1, s 1 }, {r 2, s 2 }, {j} j r 1, s 1, r 2, s 2. If r 1 = r 2, s 1 s 2 N 2 clusters {r 1, s 1, s 2 }, {j}, j r 1, s 1, s 2. If r 1 r 2, s 1 = s 2... l) join cluster containing r l with cluster containing s l Stop if d rl,s l > threshold d 0.
Prof. Dr. J. Franke All of Statistics 3.21 Average linkage D rs = distance between cluster r and cluster s 0) D (0) ij = d ij for C (0) j = {j} 1) As in nnsl, distance of new cluster C r (0) 1 + C s (0) 1 C (0) j, j r 1, s 1 : to cluster D (1) 1j = 1 2 (d r 1 j + d s1 j) 2) Join the two clusters with minimal distance D r (i 1) s. Define distance of new cluster to the other clusters by D (i) = 1 2 (D(i 1) r k + D (i 1) s k ), k r, s 3) Stop if all cluster distances > d 0.
Prof. Dr. J. Franke All of Statistics 3.22 Data: Protein consumption in N = 25 European countries for p = 9 food groups: 1. Red meat 2. White meat 3. Eggs 4. Milk 5. Fish 6. Cereals 7. Starchy foods 8. Pulses, nuts, and oil-seeds 9. Fruits and vegetables
Prof. Dr. J. Franke All of Statistics 3.23 Complete linkage cluster analysis: Eastern Europe (blue): East Germany, Czechoslovakia, Poland, USSR, Hungary; Scandinavia (green): Sweden, Denmark, Norway, Finland; Western Europe (red): UK, France, West Germany, Belgium, Ireland, Netherlands, Austria, Switzerland; Iberian (purple) Spain, Portugal; Mediterranean (orange) Italy, Greece; the Balkans (yellow) Yugoslavia, Romania, Bulgaria, Albania. PCA: dimension reduction to q = 4 principal components: 1: total meat consumptiom 2-4: Consumption of red meat, white meat resp. fish Weber, A. (1973) Agrarpolitik im Spannungsfeld der internationalen Ernährungspolitik, Institut für Agrarpolitik und Marktlehre, Kiel. Data for download: http://lib.stat.cmu.edu/dasl/stories/proteinconsumptionineurope.html
Prof. Dr. J. Franke All of Statistics 3.24