INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT. Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/ PDF Free Download

INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT Estadística Biomèdica Avançada Ricardo Gonzalo Sanz ricardo.gonzalo@vhir.org 13/07/2015

1. Introduction to Multivariate Analysis 2. Summary Statistics for Multivariate Data 3. Inference with Multivariate data 4. Principal Components analysis

1. Introduction to multivariate analysis Difficult with only one variable... In real life most phenomena are complex and can rarely be described using a single variable: Socio economic surveys Clinical studies Economical indices

1. Introduction to multivariate analysis Some examples... Coronary Heart Study has measured 7 variables: Arterial Tension, Age, Weight, Body Surface, Years suffering HT, Pulse, Stress. Nutritional study: data on 29 fast-food products: Price, weight, calories, protein, fat, saturatedfat, sodium, iron, calcium, Vitamin_a, Vitamin:C, food_type Risk prediction models for prostate cancer: race, age, sex, genetics, body mass index, family history of cancer, history of tobacco use, use of aspirin and nonsteroidal anti-inflammatory drugs (NSAIDS), physical activity, use of hormone replacement therapy, reproductive factors, history of cancer screening, and dietary factors.

1. Introduction to multivariate analysis Some examples... Gene expression analysis with high throughput techniques (microarrays, RNA-Seq, )

1. Introduction to multivariate analysis Nowadays is very common listening about MVA SARA H. ASENADOR DIARIO EXPANSIÓN 27/06/2015

1. Introduction to multivariate analysis Difficult with only one variable... All these are examples of multidimensional data which requires multivariate statistical techniques to deal with them Multivariate data arise when researchers measure several variables on each unit in their sample. The majority of data sets collected by researchers in all disciplines are multivariate. in some cases it may make sense to isolate each variable and study it separately, in the main it does not (only the simultaneously study of variables will uncover the patterns of the data)

Observations (n) Observations (n) 1. Introduction to multivariate analysis How the data looks: Univariate statistics Multivariate statistics Variables (K) Variables (K) K>n K<n

Observations (n) 1. Introduction to multivariate analysis How the data looks: Multivariate statistics Variables (K) K>n K<n

1. Introduction to multivariate analysis How the data looks: Couple Hage Hheight Wage Wheight Hagefm 1 49 1809 43 1590 25 2 25 1841 28 1560 19 3 40 1659 30 1620 38 4 52 1779 57 1540 26 5 58 1616 52 1420 30 6 32 1695 27 1660 23 7 43 1730 52 1610 33 8 47 1740 43 1580 26 9 31 1685 23 1610 26 10 26 1735 25 1590 23 Huswif dataset. # The observations are 10 married couples # Hage: the husband's age (in years). # Hheight: the husband's height (in mm). # Wage: the wife's age (in years). # Wheight: the wife's's height (in mm). # Hagefm: husband's age (in years) at first marriage.

1. Introduction to multivariate analysis How the data looks: Huswif dataset.

Observations (n) 1. Introduction to multivariate analysis How can be studied: One approach: group techniques differently depending if 1. The goal is to model the relation between one or more independent explanatory variables and one or more dependent variables. Multiple regression, Factor Analysis, Discriminant Analysis, 2. The goal is to model the relation between a group of variables where none of them has special relevance. Principal components Analysis, Cluster Analysis, MDS,.

1. Introduction to Multivariate Analysis 2. Summary Statistics for Multivariate Data 3. Inference with Multivariate data 4. Principal Components analysis

2. Summary Statistics for Multivariate Data Numeric summaries: 1. Summaries for each of the variables separately Means Variances 2. Summarize the relationships between the variables Covariances Correlations Distances

2. Summary Statistics for Multivariate Data Mean: (huswif data set)

2. Summary Statistics for Multivariate Data Mean: (huswif data set) mean sd n Hage 40.3 11.411982 10 Hagefm 26.9 5.466057 10 Hheight 1728.9 68.607499 10 Wage 38.0 12.832251 10 Wheight 1578.0 64.601342 10 Variances: is a measure of the spread of variable values > apply(huswif,2,var) Hage Hheight Wage Wheight Hagefm 130.23333 4706.98889 164.66667 4173.33333 29.87778

2. Summary Statistics for Multivariate Data Covariances: It is a measure of how two variables change together in the dataset. covariances Variances > var(huswif) Hage Hheight Wage Wheight Hagefm Hage 130.23333-192.18889 128.55556-436.0000 28.03333 Hheight -192.18889 4706.98889 25.88889 876.4444-229.34444 Wage 128.55556 25.88889 164.66667-456.6667 21.66667 Wheight -436.00000 876.44444-456.66667 4173.3333-8.00000 Hagefm 28.03333-229.34444 21.66667-8.0000 29.87778 Covariance matrix

2. Summary Statistics for Multivariate Data Correlations: It is a measure of the strength and direction of the linear relationship between two variables. We will know if the two variables are related or there are independent. Values go from -1 to +1. > cor(huswif) Hage Hheight Wage Wheight Hagefm Hage 1.0000000-0.2454684 0.8778634-0.59140348 0.44940667 Hheight -0.2454684 1.0000000 0.0294062 0.19774762-0.61156482 Wage 0.8778634 0.0294062 1.0000000-0.55087737 0.30889801 Wheight -0.5914035 0.1977476-0.5508774 1.00000000-0.02265553 Hagefm 0.4494067-0.6115648 0.3088980-0.02265553 1.00000000

2. Summary Statistics for Multivariate Data Distances: Most common measure of distance is the Euclidean distance:

2. Summary Statistics for Multivariate Data Distances: > dist(scale(huswif)) 1 2 3 4 5 6 7 8 9 2 2.725315 3 3.507198 4.662010 4 1.443297 3.641078 3.865915 5 4.097448 5.600133 4.188873 3.171566 6 2.800491 2.800462 2.955585 3.713186 5.074846 7 2.081379 3.970204 2.222621 2.024355 3.666068 2.988463 8 1.048628 2.997644 2.828334 1.445258 3.370065 2.355273 1.578504 9 2.883217 2.799552 2.430380 3.668488 4.571880 1.013093 2.878059 2.291979 10 2.706807 1.789020 3.260408 3.566634 4.885015 1.347130 3.177160 2.384186 1.069791

Height 1 2 3 4 5 2 10 6 9 5 dist(scale(huswif)) hclust (*, "complete") 3 7 4 1 8 2. Summary Statistics for Multivariate Data Distances: Cluster Dendrogram > plot(hclust(dist(scale(huswif))))

2. Summary Statistics for Multivariate Data Graphical summaries: Scatterplot

2. Summary Statistics for Multivariate Data Graphical summaries: Boxplot boxplot(scale(huswif),col= red )

2. Summary Statistics for Multivariate Data Graphical summaries: Star plot Star plot of Huswif dataset stars(huswif,full=true,scale=true,labels= c(1:10),key.loc=c(8,2),main="star plot of Huswif dataset",draw.segments=true) 1 2 3 4 5 6 Each star represents one couple of the dataset; each ray in the star is proportional to one variable 7 8 9 Hheight Hage Wage 10 Wheight Hagefm

2. Summary Statistics for Multivariate Data Graphical summaries: Biplot Fuel, gear ratio size

2. Summary Statistics for Multivariate Data Exercise Description: Largemouth bass were studied in 53 different Florida lakes to examine the factors that influence the level of mercury contamination. Water samples were collected from the surface of the middle of each lake in August 1990 and then again in March 1991. The ph level, the amount of chlorophyll, calcium, and alkalinity were measured in each sample. The average of the August and March values were used in the analysis. Next, a sample of fish was taken from each lake with sample sizes ranging from 4 to 44 fish. The age of each fish and mercury concentration in the muscle tissue was measured. (Note: Since fish absorb mercury over time, older fish will tend to have higher concentrations). Thus, to make a fair comparison of the fish in different lakes, the investigators used a regression estimate of the expected mercury concentration in a three year old fish as the standardized value for each lake. Finally, in 10 of the 53 lakes, the age of the individual fish could not be determined and the average mercury concentration of the sampled fish was used instead of the standardized value Dataset: Mercury.txt

2. Summary Statistics for Multivariate Data Exercise Variable Names: ID: ID number Lake: Name of the lake Alkalinity: Alkalinity (mg/l as Calcium Carbonate) ph: ph Calcium: Calcium (mg/l) Chlorophyll: Chlorophyll (mg/l) Avg_Mercury: Average mercury concentration (parts per million) in the muscle tissue of the fish sampled from that lake No.samples: How many fish were sampled from the lake min: Minimum mercury concentration amongst the sampled fish max: Maximum mercury concentration amongst the sampled fish 3_yr_Standard_mercury : Regression estimate of the mercury concentration in a 3 year old fish from the lake (or = Avg Mercury when age data was not available) age_data: Indicator of the availability of age data on fish sampled

2. Summary Statistics for Multivariate Data Exercise

2. Summary Statistics for Multivariate Data Exercise mean sd IQR 0% 25% 50% 75% 100% n age_data 0.8113208 0.3949977 0.00 0.00 1.00 1.00 1.00 1.00 53 Alkalinity 37.5301887 38.2035267 59.90 1.20 6.60 19.60 66.50 128.00 53 Avg_Mercury 0.5271698 0.3410356 0.50 0.04 0.27 0.48 0.77 1.33 53 Calcium 22.2018868 24.9325744 32.30 1.10 3.30 12.60 35.60 90.70 53 Chlorophyll 23.1169811 30.8163214 20.10 0.70 4.60 12.80 24.70 152.40 53 max 0.8745283 0.5220469 0.85 0.06 0.48 0.84 1.33 2.04 53 min 0.2798113 0.2264058 0.24 0.04 0.09 0.25 0.33 0.92 53 No.samples 13.0566038 8.5606773 2.00 4.00 10.00 12.00 12.00 44.00 53 ph 6.5905660 1.2884493 1.60 3.60 5.80 6.80 7.40 9.10 53 > apply(mercurio[,c(3:10)],2,var) Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min max 1.459509e+03 1.6602e+00 6.216333e+02 9.496457e+02 1.163053e-01 7.32520e+01 5.125958e-02 2.72539e-01

2. Summary Statistics for Multivariate Data > var(mercurio[,c(3:12)]) Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min Alkalinity 1459.509456 35.39971335 793.065711 562.193324-7.73773984 3.36556604-4.544071118 ph 35.399713 1.66010160 18.540018 24.159971-0.25283491-0.20522496-0.158097968 Calcium 793.065711 18.54001814 621.633266 314.949198-3.40693687-19.07703193-1.876788099 Chlorophyll 562.193324 24.15997097 314.949198 949.645668-5.16408563-3.11828737-2.793996734 Avg_Mercury -7.737740-0.25283491-3.406937-5.164086 0.11630530 0.23074020 0.071591763 No.samples 3.365566-0.20522496-19.077032-3.118287 0.23074020 73.28519594-0.158258345 min -4.544071-0.15809797-1.876788-2.793997 0.07159176-0.15825835 0.051259579 max -12.062062-0.37116800-5.309432-7.802021 0.16305729 0.71993106 0.090460486 X3_yr_Standard_Mercury -8.126195-0.26746916-3.922122-5.286440 0.11080733 0.07481495 0.070485232 age_data -1.432656 0.01933962-0.020791-3.444811 0.01464804 0.70319303 0.009002177 max X3_yr_Standard_Mercury age_data Alkalinity -12.06206241-8.12619485-1.432656023 ph -0.37116800-0.26746916 0.019339623 Calcium -5.30943179-3.92212155-0.020791001 Chlorophyll -7.80202068-5.28644013-3.444811321 Avg_Mercury 0.16305729 0.11080733 0.014648041 No.samples 0.71993106 0.07481495 0.703193033 min 0.09046049 0.07048523 0.009002177 max 0.27253295 0.15203327 0.019332366 X3_yr_Standard_Mercury 0.15203327 0.11473759 0.011962990 age_data 0.01933237 0.01196299 0.156023222

2. Summary Statistics for Multivariate Data > cor(mercurio[,c(3:12)]) Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min Alkalinity 1.00000000 0.71916568 0.832604192 0.47753085-0.59389671 0.01029074-0.52535654 ph 0.71916568 1.00000000 0.577132721 0.60848276-0.57540012-0.01860607-0.54196524 Calcium 0.83260419 0.57713272 1.000000000 0.40991385-0.40067958-0.08937901-0.33247623 Chlorophyll 0.47753085 0.60848276 0.409913846 1.00000000-0.49137481-0.01182027-0.40045856 Avg_Mercury -0.59389671-0.57540012-0.400679584-0.49137481 1.00000000 0.07903426 0.92720506 No.samples 0.01029074-0.01860607-0.089379013-0.01182027 0.07903426 1.00000000-0.08165278 min -0.52535654-0.54196524-0.332476229-0.40045856 0.92720506-0.08165278 1.00000000 max -0.60479558-0.55181523-0.407916635-0.48497215 0.91586397 0.16109174 0.76535319 X3_yr_Standard_Mercury -0.62795845-0.61284905-0.464409465-0.50644193 0.95921481 0.02580046 0.91908939 age_data -0.09493882 0.03800021-0.002111124-0.28300234 0.10873896 0.20795617 0.10066197 max X3_yr_Standard_Mercury age_data Alkalinity -0.60479558-0.62795845-0.094938825 ph -0.55181523-0.61284905 0.038000214 Calcium -0.40791663-0.46440947-0.002111124 Chlorophyll -0.48497215-0.50644193-0.283002338 Avg_Mercury 0.91586397 0.95921481 0.108738958 No.samples 0.16109174 0.02580046 0.207956171 min 0.76535319 0.91908939 0.100661967 max 1.00000000 0.85975810 0.093752072 X3_yr_Standard_Mercury 0.85975810 1.00000000 0.089411267 age_data 0.09375207 0.08941127 1.000000000

2. Summary Statistics for Multivariate Data

2. Summary Statistics for Multivariate Data boxplot(scale(mercurio[,c(3:12)]))

2. Summary Statistics for Multivariate Data > dist(scale(mercurio[,c(3:12)])) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 2.9016321 3 9.5511161 9.6958499 4 5.9700052 5.7901963 5.4975211 5 1.7049716 2.9444491 9.4567670 5.5146204 6 6.5117538 7.3672340 5.2034732 3.1605559 6.0163661 7 4.8371073 5.6709130 6.9644420 3.0902148 4.1053618 2.5577068 8 7.2675708 8.1380443 4.3833318 3.6614673 6.9761643 2.5924963 3.9568669 9 4.3839465 5.0895482 7.6604933 3.5745672 3.3692068 3.6552183 2.5621512 4.6618097 10 3.3235193 4.3136751 7.3683176 3.6524397 2.7583312 3.5624804 2.3264386 4.7436432 1.8778757 11 3.5606840 4.5499254 7.3662122 3.6292460 2.8388717 3.4324057 1.5017797 4.6538165 2.2577812...

2. Summary Statistics for Multivariate Data > plot(hclust(dist(scale(mercurio[,c(3:12)]))),labels=mercurio[,2])

1. Introduction to Multivariate Analysis 2. Summary Statistics for Multivariate Data 3. Inference with Multivariate data 4. Principal Components analysis

3. Inference with Multivariate data Hotelling and MANOVA test. Hotellings T test: it would analogues of the familiar student t test from univariate analysis. It tests of the differences between the (multivariate) means of different populations MANOVA: It would analogues of the ANOVA of the univariate analysis. The means of different variables in different populations are computed.

1. Introduction to Multivariate Analysis 2. Summary Statistics for Multivariate Data 3. Inference with Multivariate data 4. Principal Components analysis

4. Principal Component Analysis Definition. PC1 = 37% Teams towards the top of the graph tipically concede more shots and win more aerial duels, while as you move down, teams attempt more short passes with greater accuracy PC2= 18% teams further to the right of the graph attempt more tackles, interceptions and dribbles

4. Principal Component Analysis Definition. Given a KxN data matrix containing K (correlated) measurements on N samples (objects/individuals ) Decomposes data matrix in new K components that account for different sources of variability in the data, are uncorrelated, that is each component accounts for a different source of variability, have decreasing explanatory ability: each component explains more than the following allow for a lower dimensional representation of the data in terms of scores on principal components. provide an overview of the dominant patterns and major trends in the data (visualize clusters, identify outliers)

4. Principal Component Analysis How does PCA works. We have 30 28 dataframe of absorbance values for 30 retention times and 28 wavelengths in an HPLC A principal component analysis consists of a repetitive process of using linear regression to find a new set of axes that are better aligned with the data. This axis represents some unknown factor that has the power to explain a significant portion of the variation in the data. This is accomplished by fitting a straight-line to the 30 data points, with the resulting linear regression model giving a new axis that best explains the data. Next, the 30 data points are projected onto the 27- dimensional surface that is perpendicular to the regression line and the process of regression and projection continues until there is a complete set of 28 new axes, each representing an unknown factor of lesser importance than those preceding it.

4. Principal Component Analysis How does PCA works. Being the data correlated it is difficult to separate each source of variability If K were much higher it would even be more difficult.

4. Principal Component Analysis How does PCA works. Transform the data Center each variable subtracting its mean Scale each variable dividing by its SD All variables are now comparable: Mean = 0 SD = 1

4. Principal Component Analysis How does PCA works. First principal component: a linear combination of all the original variables that goes along the direction of highest variability in the data explains the maximum amount of variation in the data How does PCA work

4. Principal Component Analysis How does PCA work 2nd principal component: a linear combination of all the original variables that goes along the next direction of highest variability in the data orthogonally to first PC explains the maximum amount of remaining variation in the data Successive PCs describe decreasing amount of remaining variation.

4. Principal Component Analysis How does PCA work PCA provides a new set of coordinates for the observations Original coordinates Value of the variables New coordinates Value of PCs: scores Scores are the new coordinates in the orthogonal system defined by PCs. X1 X2

4. Principal Component Analysis How does PCA works. PCs have been derived so that They are orthogonal Each PC explains the maximum amount of remaining variation in the data This means that it is not necessary to use all PCs to visualize the data in this new coordinate system Taking the first PCs will often explain a high percentage of variability. Usually only first 2 or 3 This should always be checked!!!

4. Principal Component Analysis How does PCA works. PCs can be interpreted by looking at which of the original variables contribute most to their variability The more a variable is correlated with a PC the highest its influence. Size of contributions of each variable: loadings Loadings are the cosines of the angle between variables and PCs

4. Principal Component Analysis How does PCA works. Summary: PCA performs a transformation into a new set of orthogonal coordinates with decreasing ability (most, 1 st PC, to least, last PC) to explain the observed variability. PCA analysis provides % of variance explained by each PC Loadings: Correlations between PCs and variables Use these to (try to) interpret what the PCs mean Scores: Values of the observations in the PC system of coordinates Use these to plot the observations in reduced dimension.

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Summary: mean sd IQR 0% 25% 50% 75% 100% n AnysHT 6.43 2.145276 2.350 2.5 5.250 6.00 7.60 10.2 20 Edat 48.60 2.500526 2.250 45.0 47.000 48.50 49.25 56.0 20 Estrés 53.35 37.086350 78.000 8.0 17.000 44.50 95.00 99.0 20 Pes 93.09 4.294905 4.625 85.4 90.225 94.15 94.85 101.3 20 Pols 69.60 3.803046 4.250 62.0 67.750 70.00 72.00 76.0 20 PressioArt 114.00 5.428967 6.250 105.0 110.000 114.00 116.25 125.0 20

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Component loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 AnysHT -0.2196463-0.4326076 0.863805862 0.09396639-0.09556582 0.00800588-0.02046973 Edat -0.3656831-0.2504879-0.153309218-0.82970342-0.17236274 0.02012800-0.24800432 Estrés -0.1795159-0.6297590-0.450148996 0.43722537-0.38322337-0.17186104-0.03132291 Pes -0.4471312 0.3324439 0.036139750 0.22336145 0.15349179-0.51151099-0.59426854 Pols -0.4268349-0.2345728-0.162218859 0.13200081 0.73111795 0.42646029 0.05144427 PressioArt -0.4881391 0.1896851-0.005473516-0.06757512-0.04692661-0.37658925 0.75968534 SupCorp -0.4067078 0.3898541 0.007113247 0.19928768-0.50398692 0.62021223-0.06457785 Loadings: Correlations between PCs and variables Use these to (try to) interpret what the PCs mean They serve as a guide to quantify how important a given variable is in a component.

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Component variances: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 3.908291334 1.470208219 0.708792320 0.521698737 0.307956215 0.080815410 0.002237764 Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Standard deviation 1.9769399 1.2125214 0.8418980 0.72228716 0.55493803 0.28428051 0.0473050057 Proportion of Variance 0.5583273 0.2100297 0.1012560 0.07452839 0.04399375 0.01154506 0.0003196805 Cumulative Proportion 0.5583273 0.7683571 0.8696131 0.94414152 0.98813526 0.99968032 1.0000000000 Each component (orderedly) explains more variability than the one that follows it. The analysis provides: Components variances Percentage of variability explained by each component The screeeplot to guide the decision on how many components should be retained in order to provide a good explanation of the data in reduced dimension.

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) PC values can be used to plot the data The plot can be used as a guide to interpret the main sources of variability In RCmdr if the option Add principal components to dataset has been selected the plot is done as usual selecting these new variables.

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt).PC <- princomp(~anysht+edat+estrés+pes+pols+pressioart+supcorp, cor=true, data=coronari) text(.pc$scores[,1],.pc$scores[,2],coronari[,7])

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) biplot(.pc)

4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Compute correlation matrix of the combined dataset. Look at the correlations between PCAs and original variables.

4. Principal Component Analysis Exercise Datos: Obreros.csv

4. Principal Component Analysis RESULTS

Variances 0 1 2 3 4 4. Principal Component Analysis Scree Plot.PC Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

4. Principal Component Analysis Principal components interpretation (loadings) PC in the dataset:

4. Principal Component Analysis Correlation between PC and variables

4. Principal Component Analysis CONCLUSIONS. PC1 separate the families for sons number and economic reasons PC2 separate CA families and other families with low son number from the other

4. Principal Component Analysis plot(hclust(dist(scale(obreros[,c(2:7)]))),labels=obreros[,1])

4. Principal Component Analysis biplot(.pc) text(.pc$scores[,1],.pc$scores[,2],obreros[,1])

INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT. Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/2015