Multivariate Analysis

Size: px

Start display at page:

Download "Multivariate Analysis"

Sarah Stone
6 years ago
Views:

1 Prof. Dr. J. Franke All of Statistics 3.1 Multivariate Analysis High dimensional data X 1,..., X N, i.i.d. random vectors in R p. As a data matrix X: objects values of p features 1 X 11 X X 1p 2. X 21. X X 2p. N X N1 X N2... X Np X T j = (X j1,..., X jp ) = j th row of X. Many statistical problems and procedures (estimators, tests,...) similar to dimension 1, but some specific for multivariate data, e.g. dimension reduction, finding few relevant features,...

2 Prof. Dr. J. Franke All of Statistics 3.2 Ideally: X 1,..., X N i.i.d. multivariate normal N p (µ, Σ) with mean vector µ and covariance matrix Σ µ k = EX jk, Σ kl = cov(x jk, X jl ), k, l = 1,..., p. Parameter estimates ˆµ = X N = 1 N Nj=1 X j ˆΣ = S = 1 N N j=1 (X j X N )(X j X N ) T, i.e. S kl = 1 N N j=1 (X jk ˆµ k )(X jl ˆµ l ) S is symmetric, all eigenvalues 0. Let d 1 d 2... d p 0 be the eigenvalues of S S = ODO T, D = diag(d 1,..., d p )

3 Prof. Dr. J. Franke All of Statistics 3.3 S = ODO T, D = diag(d 1,..., d p ) O orthogonal matrix (basis transformation), columns of O = orthogonal basis of eigenvectors of S. Analogously: δ 1 δ 2... δ p 0 eigenvalues of Σ = diag(δ 1,..., δ p ) = Ω T ΣΩ Σ = Ω Ω T Principal Component Analysis (PCA) X 0 representative of X 1,..., X N principal component transformation: X 0 W = Ω T (X 0 µ) k th principal component of X 0 : W k = X 0 µ, e k where e 1,..., e p = normed eigenvectors of Σ = columns of Ω

4 Prof. Dr. J. Franke All of Statistics 3.4 W = p k=1 W ke k representation of X 0 µ in eigenbasis of Σ. Properties: EW k = 0, var W k = δ k, cov(w k, W l ) = 0, k l var W 1 var W 2... var W p If X 0 N p (µ, Σ)-distributed, then W N p (0, )-distributed, and, hence, W 1,..., W p independent! var W 1 = max{var U; U = p k=1 α k X 0k, α 1,..., α p R, p k=1 α 2 k = 1} Idea of principal component analysis (dimension reduction): Find q p linear combinations of X 01,..., X 0p which explain a large percentage of the variability in the features.

5 Prof. Dr. J. Franke All of Statistics 3.5 Solution: W 1,..., W q Open questions: µ, Σ =?, q =? Empirical principal components Sample X 1,..., X N i.i.d., sample covariance matrix S = ODO T, sample mean ˆµ = X N principal component transformation: X j V j = O T (X j X N ) Use features V j1,..., V jq instead of X j1,..., X jp, j = 1,..., N. Selection of q : d 1 d 2... d p eigenvalues of S. The proportion of variability of X 0 explained by W 1,..., W q is δ δ q, estimated by d d q.

6 Prof. Dr. J. Franke All of Statistics 3.6 scree graph: plot d k or d k p l=1 d l against k choose that q where the graph becomes flat. Rules of thumb: a) Choose q s.th. 90% of total variability explained: q = min{r p; d d r 0.9 p k=1 d k = 0.9 tr S} b) (Kaiser) Consider only principal components with above average variance: q = max{r p; d r 1 p p k=1 d k }

7 Prof. Dr. J. Franke All of Statistics 3.7 N = 180 pit props cut from Corsican pine (Jeffreys, 1967) Goal (regression): Y j = maximum compressive strength as a function of p = 13 predictor variables: 1: top diameter 2: length 3: moisture content (% of dry weight) 4: specific gravity at test 5: oven-dry specific gravity of timber 6: no. of annual rings: at top 7:...: at base 8: maximum bow 9: distance top to point of maximum bow 10: no. of knot whorls 11: length of clear prop from top 12: average no. of knots per whorl 13: average diameter of knots Principal component analysis for X j1,..., X jp, j = 1,..., N

8 Prof. Dr. J. Franke All of Statistics 3.8 scree graph (k, d k ) for Corsican pitprop data

9 Prof. Dr. J. Franke All of Statistics 3.9 Eigenvector w.r.t. d 1 : e 1 = ( 0.40, 0.41, 0.12, 0.17, 0.06, 0.28, 0.40, 0.29, 0.36, 0.38, 0.01, 0.12, 0.11) T V j1 average of X jk, k = 1, 2, 6-10 total size of pit prop V j2 average of X jk, V j3 average of X jk, k = 3, 4 degree of seasoning k = 4-7 speed of growth V j4 X j11 length of clear prop from top V j5 X j12 average no. of knots per whorl V j6 average of X jk, k = 5, 13 Rules of thumb: a) q = 6 b) q = 4 Scree graph q = 3 or q = 6

10 Prof. Dr. J. Franke All of Statistics 3.10 Discriminant Analysis Classification problem: object from class C 1,..., C m Observed: feature vector X 0 Ass.: X 0 has density f k (x) if object from C k Bayes classifier: decide for class C k if X 0 {x; f k (x) = max f i(x)} (1) i=1,...,m Gaussian case: if object from C k, X 0 is N p (µ k, Σ)-distributed. Then, (1) is equivalent to X 0 µ k 2 Σ = (X 0 µ k ) T Σ 1 (X 0 µ k ) =. Σ = Mahalanobis distance w.r.t. Σ min i=1,...,m X 0 µ i 2 Σ

11 Prof. Dr. J. Franke All of Statistics 3.11 Special case m = 2 : decide for C 1 if α T (X 0 µ) > 0 with µ = 1 2 (µ 1 + µ 2 ), α = Σ 1 (µ 1 µ 2 ). In practice, µ 1,..., µ m, Σ unknown. Training set X (k) j, j = 1,..., n k, i.i.d. N p (µ k, Σ), k = 1,..., m, with known classification sample means and sample covariance matrices for each subsample ˆµ k = 1 n k n k j=1 X (k) j, S (k), k = 1,..., m Combine S (1),..., S (m) to estimate of Σ S = 1 N m k=1 n k S (k), N = n n m

12 Prof. Dr. J. Franke All of Statistics 3.12 empirical classification rule: decide for C k if X 0 ˆµ k 2 S = min i=1,...,m X 0 ˆµ i 2 S Warning: If µ 1 =... = µ m, classification meaningless, but ˆµ 1,..., ˆµ m not equal. Safeguard: Test H 0 : µ 1 =... = µ m (multivariate ANOVA) Fisher s discriminant rule No Gaussian assumption consider only linear discriminant functions, i.e. decide for C k if a T (X 0 ˆµ k ) < a T (X 0 ˆµ i ) for all i k.

13 Prof. Dr. J. Franke All of Statistics 3.13 a T (X 0 ˆµ k ) < a T (X 0 ˆµ i ) for all i k. Choose a s.th. the discriminant function shows maximal differences between the groups (of a training set) a is an eigenvector w.r.t. largest eigenvalue of S 1 B where B = X N = 1 N m k=1 m n k (ˆµ k X N )(X N ˆµ k ) T, k=1 n kˆµ k = 1 N m n k k=1 j=1 X (k) j. For m = 2, Fisher s rule coincides with Bayes classification for Gaussian data. For m > 2, usually not.

14 Prof. Dr. J. Franke All of Statistics 3.18 Cluster analysis Discriminant analysis: Given classes C 1,..., C m. Find classification rule. Estimate it based on training set (supervised learning). Cluster analysis: Find appropriate classes! No information about which object belongs to which group (unsupervised learning). Observed feature vectors X 1,..., X N R p, independent. Assume, e.g., that all X j are N p (µ k, Σ k )-distributed if X j belongs to class C k for some unknown µ k, Σ k, k = 1,..., m, and unknown m, C 1,..., C m.

15 Prof. Dr. J. Franke All of Statistics 3.19 Maximum likelihood over µ k, Σ k, C 1,..., C m and m in principle possible with penalty for large m to avoid overfitting computationally not feasible usually. Hierarchical clustering algorithms Needed: distance between objects i, j common choice: Pearson distance of feature vectors d 2 ij = p k=1 s 2 k = 1 N 1 (X ik X jk ) 2, N s 2 k j=1(x jk ˆµ k ) 2, ˆµ k = 1 N N X jk j=1 Agglomerative: Start with maximal number of clusters, i.e. each object is an own cluster: C (0) j = {j}, j = 1,..., N.

16 Prof. Dr. J. Franke All of Statistics 3.20 Nearest neighbour single linkage (nnsl) sort d ij, i < j d r1 s 1 d r2 s ) N 1 clusters C (0) r 1 + C (0) s 1, C (0) j, j r 1, s 1 2) If r 1, s 1 r 2, s 2 N 2 clusters {r 1, s 1 }, {r 2, s 2 }, {j} j r 1, s 1, r 2, s 2. If r 1 = r 2, s 1 s 2 N 2 clusters {r 1, s 1, s 2 }, {j}, j r 1, s 1, s 2. If r 1 r 2, s 1 = s 2... l) join cluster containing r l with cluster containing s l Stop if d rl,s l > threshold d 0.

17 Prof. Dr. J. Franke All of Statistics 3.21 Average linkage D rs = distance between cluster r and cluster s 0) D (0) ij = d ij for C (0) j = {j} 1) As in nnsl, distance of new cluster C r (0) 1 + C s (0) 1 C (0) j, j r 1, s 1 : to cluster D (1) 1j = 1 2 (d r 1 j + d s1 j) 2) Join the two clusters with minimal distance D r (i 1) s. Define distance of new cluster to the other clusters by D (i) = 1 2 (D(i 1) r k + D (i 1) s k ), k r, s 3) Stop if all cluster distances > d 0.

18 Prof. Dr. J. Franke All of Statistics 3.22 Data: Protein consumption in N = 25 European countries for p = 9 food groups: 1. Red meat 2. White meat 3. Eggs 4. Milk 5. Fish 6. Cereals 7. Starchy foods 8. Pulses, nuts, and oil-seeds 9. Fruits and vegetables

19 Prof. Dr. J. Franke All of Statistics 3.23 Complete linkage cluster analysis: Eastern Europe (blue): East Germany, Czechoslovakia, Poland, USSR, Hungary; Scandinavia (green): Sweden, Denmark, Norway, Finland; Western Europe (red): UK, France, West Germany, Belgium, Ireland, Netherlands, Austria, Switzerland; Iberian (purple) Spain, Portugal; Mediterranean (orange) Italy, Greece; the Balkans (yellow) Yugoslavia, Romania, Bulgaria, Albania. PCA: dimension reduction to q = 4 principal components: 1: total meat consumptiom 2-4: Consumption of red meat, white meat resp. fish Weber, A. (1973) Agrarpolitik im Spannungsfeld der internationalen Ernährungspolitik, Institut für Agrarpolitik und Marktlehre, Kiel. Data for download:

20 Prof. Dr. J. Franke All of Statistics 3.24

STATISTICA MULTIVARIATA 2

1 / 73 STATISTICA MULTIVARIATA 2 Fabio Rapallo Dipartimento di Scienze e Innovazione Tecnologica Università del Piemonte Orientale, Alessandria (Italy) fabio.rapallo@uniupo.it Alessandria, May 2016 2 /