INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT. Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/2015

1 INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/2015

2 1. Introduction to Multivariate Analysis 2. Summary Statistics for Multivariate Data 3. Inference with Multivariate data 4. Principal Components analysis

4 1. Introduction to multivariate analysis Difficult with only one variable... In real life most phenomena are complex and can rarely be described using a single variable: Socio economic surveys Clinical studies Economical indices

5 1. Introduction to multivariate analysis Some examples... Coronary Heart Study has measured 7 variables: Arterial Tension, Age, Weight, Body Surface, Years suffering HT, Pulse, Stress. Nutritional study: data on 29 fast-food products: Price, weight, calories, protein, fat, saturatedfat, sodium, iron, calcium, Vitamin_a, Vitamin:C, food_type Risk prediction models for prostate cancer: race, age, sex, genetics, body mass index, family history of cancer, history of tobacco use, use of aspirin and nonsteroidal anti-inflammatory drugs (NSAIDS), physical activity, use of hormone replacement therapy, reproductive factors, history of cancer screening, and dietary factors.

6 1. Introduction to multivariate analysis Some examples... Gene expression analysis with high throughput techniques (microarrays, RNA-Seq, )

7 1. Introduction to multivariate analysis Nowadays is very common listening about MVA SARA H. ASENADOR DIARIO EXPANSIÓN 27/06/2015

8 1. Introduction to multivariate analysis Difficult with only one variable... All these are examples of multidimensional data which requires multivariate statistical techniques to deal with them Multivariate data arise when researchers measure several variables on each unit in their sample. The majority of data sets collected by researchers in all disciplines are multivariate. in some cases it may make sense to isolate each variable and study it separately, in the main it does not (only the simultaneously study of variables will uncover the patterns of the data)

10 Observations (n) Observations (n) 1. Introduction to multivariate analysis How the data looks: Univariate statistics Multivariate statistics Variables (K) Variables (K) K>n K<n

11 Observations (n) 1. Introduction to multivariate analysis How the data looks: Multivariate statistics Variables (K) K>n K<n

12 1. Introduction to multivariate analysis How the data looks: Couple Hage Hheight Wage Wheight Hagefm Huswif dataset. # The observations are 10 married couples # Hage: the husband's age (in years). # Hheight: the husband's height (in mm). # Wage: the wife's age (in years). # Wheight: the wife's's height (in mm). # Hagefm: husband's age (in years) at first marriage.

13 1. Introduction to multivariate analysis How the data looks: Huswif dataset.

14 Observations (n) 1. Introduction to multivariate analysis How can be studied: One approach: group techniques differently depending if 1. The goal is to model the relation between one or more independent explanatory variables and one or more dependent variables. Multiple regression, Factor Analysis, Discriminant Analysis, 2. The goal is to model the relation between a group of variables where none of them has special relevance. Principal components Analysis, Cluster Analysis, MDS,.

17 2. Summary Statistics for Multivariate Data Numeric summaries: 1. Summaries for each of the variables separately Means Variances 2. Summarize the relationships between the variables Covariances Correlations Distances

18 2. Summary Statistics for Multivariate Data Mean: (huswif data set)

19 2. Summary Statistics for Multivariate Data Mean: (huswif data set) mean sd n Hage Hagefm Hheight Wage Wheight Variances: is a measure of the spread of variable values > apply(huswif,2,var) Hage Hheight Wage Wheight Hagefm

20 2. Summary Statistics for Multivariate Data Covariances: It is a measure of how two variables change together in the dataset. covariances Variances > var(huswif) Hage Hheight Wage Wheight Hagefm Hage Hheight Wage Wheight Hagefm Covariance matrix

21 2. Summary Statistics for Multivariate Data Correlations: It is a measure of the strength and direction of the linear relationship between two variables. We will know if the two variables are related or there are independent. Values go from -1 to +1. > cor(huswif) Hage Hheight Wage Wheight Hagefm Hage Hheight Wage Wheight Hagefm

22 2. Summary Statistics for Multivariate Data Distances: Most common measure of distance is the Euclidean distance:

23 2. Summary Statistics for Multivariate Data Distances: > dist(scale(huswif))

24 Height dist(scale(huswif)) hclust (*, "complete") Summary Statistics for Multivariate Data Distances: Cluster Dendrogram > plot(hclust(dist(scale(huswif))))

25 2. Summary Statistics for Multivariate Data Graphical summaries: Scatterplot

26 2. Summary Statistics for Multivariate Data Graphical summaries: Boxplot boxplot(scale(huswif),col= red )

27 2. Summary Statistics for Multivariate Data Graphical summaries: Star plot Star plot of Huswif dataset stars(huswif,full=true,scale=true,labels= c(1:10),key.loc=c(8,2),main="star plot of Huswif dataset",draw.segments=true) Each star represents one couple of the dataset; each ray in the star is proportional to one variable Hheight Hage Wage 10 Wheight Hagefm

28 2. Summary Statistics for Multivariate Data Graphical summaries: Biplot Fuel, gear ratio size

29 2. Summary Statistics for Multivariate Data Exercise Description: Largemouth bass were studied in 53 different Florida lakes to examine the factors that influence the level of mercury contamination. Water samples were collected from the surface of the middle of each lake in August 1990 and then again in March The ph level, the amount of chlorophyll, calcium, and alkalinity were measured in each sample. The average of the August and March values were used in the analysis. Next, a sample of fish was taken from each lake with sample sizes ranging from 4 to 44 fish. The age of each fish and mercury concentration in the muscle tissue was measured. (Note: Since fish absorb mercury over time, older fish will tend to have higher concentrations). Thus, to make a fair comparison of the fish in different lakes, the investigators used a regression estimate of the expected mercury concentration in a three year old fish as the standardized value for each lake. Finally, in 10 of the 53 lakes, the age of the individual fish could not be determined and the average mercury concentration of the sampled fish was used instead of the standardized value Dataset: Mercury.txt

30 2. Summary Statistics for Multivariate Data Exercise Variable Names: ID: ID number Lake: Name of the lake Alkalinity: Alkalinity (mg/l as Calcium Carbonate) ph: ph Calcium: Calcium (mg/l) Chlorophyll: Chlorophyll (mg/l) Avg_Mercury: Average mercury concentration (parts per million) in the muscle tissue of the fish sampled from that lake No.samples: How many fish were sampled from the lake min: Minimum mercury concentration amongst the sampled fish max: Maximum mercury concentration amongst the sampled fish 3_yr_Standard_mercury : Regression estimate of the mercury concentration in a 3 year old fish from the lake (or = Avg Mercury when age data was not available) age_data: Indicator of the availability of age data on fish sampled

31 2. Summary Statistics for Multivariate Data Exercise

32 2. Summary Statistics for Multivariate Data Exercise mean sd IQR 0% 25% 50% 75% 100% n age_data Alkalinity Avg_Mercury Calcium Chlorophyll max min No.samples ph > apply(mercurio[,c(3:10)],2,var) Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min max e e e e e e e e-01

33 2. Summary Statistics for Multivariate Data > var(mercurio[,c(3:12)]) Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min max X3_yr_Standard_Mercury age_data max X3_yr_Standard_Mercury age_data Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min max X3_yr_Standard_Mercury age_data

34 2. Summary Statistics for Multivariate Data > cor(mercurio[,c(3:12)]) Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min max X3_yr_Standard_Mercury age_data max X3_yr_Standard_Mercury age_data Alkalinity ph Calcium Chlorophyll Avg_Mercury No.samples min max X3_yr_Standard_Mercury age_data

35 2. Summary Statistics for Multivariate Data

36 2. Summary Statistics for Multivariate Data boxplot(scale(mercurio[,c(3:12)]))

37 2. Summary Statistics for Multivariate Data > dist(scale(mercurio[,c(3:12)]))

38 2. Summary Statistics for Multivariate Data > plot(hclust(dist(scale(mercurio[,c(3:12)]))),labels=mercurio[,2])

40 3. Inference with Multivariate data Hotelling and MANOVA test. Hotellings T test: it would analogues of the familiar student t test from univariate analysis. It tests of the differences between the (multivariate) means of different populations MANOVA: It would analogues of the ANOVA of the univariate analysis. The means of different variables in different populations are computed.

42 4. Principal Component Analysis Definition. PC1 = 37% Teams towards the top of the graph tipically concede more shots and win more aerial duels, while as you move down, teams attempt more short passes with greater accuracy PC2= 18% teams further to the right of the graph attempt more tackles, interceptions and dribbles

43 4. Principal Component Analysis Definition. Given a KxN data matrix containing K (correlated) measurements on N samples (objects/individuals ) Decomposes data matrix in new K components that account for different sources of variability in the data, are uncorrelated, that is each component accounts for a different source of variability, have decreasing explanatory ability: each component explains more than the following allow for a lower dimensional representation of the data in terms of scores on principal components. provide an overview of the dominant patterns and major trends in the data (visualize clusters, identify outliers)

44 4. Principal Component Analysis How does PCA works. We have dataframe of absorbance values for 30 retention times and 28 wavelengths in an HPLC A principal component analysis consists of a repetitive process of using linear regression to find a new set of axes that are better aligned with the data. This axis represents some unknown factor that has the power to explain a significant portion of the variation in the data. This is accomplished by fitting a straight-line to the 30 data points, with the resulting linear regression model giving a new axis that best explains the data. Next, the 30 data points are projected onto the 27- dimensional surface that is perpendicular to the regression line and the process of regression and projection continues until there is a complete set of 28 new axes, each representing an unknown factor of lesser importance than those preceding it.

45 4. Principal Component Analysis How does PCA works. Being the data correlated it is difficult to separate each source of variability If K were much higher it would even be more difficult.

46 4. Principal Component Analysis How does PCA works. Transform the data Center each variable subtracting its mean Scale each variable dividing by its SD All variables are now comparable: Mean = 0 SD = 1

47 4. Principal Component Analysis How does PCA works. First principal component: a linear combination of all the original variables that goes along the direction of highest variability in the data explains the maximum amount of variation in the data How does PCA work

48 4. Principal Component Analysis How does PCA work 2nd principal component: a linear combination of all the original variables that goes along the next direction of highest variability in the data orthogonally to first PC explains the maximum amount of remaining variation in the data Successive PCs describe decreasing amount of remaining variation.

49 4. Principal Component Analysis How does PCA work PCA provides a new set of coordinates for the observations Original coordinates Value of the variables New coordinates Value of PCs: scores Scores are the new coordinates in the orthogonal system defined by PCs. X1 X2

50 4. Principal Component Analysis How does PCA works. PCs have been derived so that They are orthogonal Each PC explains the maximum amount of remaining variation in the data This means that it is not necessary to use all PCs to visualize the data in this new coordinate system Taking the first PCs will often explain a high percentage of variability. Usually only first 2 or 3 This should always be checked!!!

51 4. Principal Component Analysis How does PCA works. PCs can be interpreted by looking at which of the original variables contribute most to their variability The more a variable is correlated with a PC the highest its influence. Size of contributions of each variable: loadings Loadings are the cosines of the angle between variables and PCs

52 4. Principal Component Analysis How does PCA works. Summary: PCA performs a transformation into a new set of orthogonal coordinates with decreasing ability (most, 1 st PC, to least, last PC) to explain the observed variability. PCA analysis provides % of variance explained by each PC Loadings: Correlations between PCs and variables Use these to (try to) interpret what the PCs mean Scores: Values of the observations in the PC system of coordinates Use these to plot the observations in reduced dimension.

54 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

55 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Summary: mean sd IQR 0% 25% 50% 75% 100% n AnysHT Edat Estrés Pes Pols PressioArt

56 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

57 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

58 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Component loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 AnysHT Edat Estrés Pes Pols PressioArt SupCorp Loadings: Correlations between PCs and variables Use these to (try to) interpret what the PCs mean They serve as a guide to quantify how important a given variable is in a component.

59 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Component variances: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Standard deviation Proportion of Variance Cumulative Proportion Each component (orderedly) explains more variability than the one that follows it. The analysis provides: Components variances Percentage of variability explained by each component The screeeplot to guide the decision on how many components should be retained in order to provide a good explanation of the data in reduced dimension.

60 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

61 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) PC values can be used to plot the data The plot can be used as a guide to interpret the main sources of variability In RCmdr if the option Add principal components to dataset has been selected the plot is done as usual selecting these new variables.

62 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

63 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

64 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt)

65 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt).PC <- princomp(~anysht+edat+estrés+pes+pols+pressioart+supcorp, cor=true, data=coronari) text(.pc$scores[,1],.pc$scores[,2],coronari[,7])

66 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) biplot(.pc)

67 4. Principal Component Analysis Example: Coronary Risk data (RiscCoronari.txt) Compute correlation matrix of the combined dataset. Look at the correlations between PCAs and original variables.

68 4. Principal Component Analysis Exercise Datos: Obreros.csv

69 4. Principal Component Analysis RESULTS

70 Variances Principal Component Analysis Scree Plot.PC Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

71 4. Principal Component Analysis Principal components interpretation (loadings) PC in the dataset:

72 4. Principal Component Analysis Correlation between PC and variables

73 4. Principal Component Analysis CONCLUSIONS. PC1 separate the families for sons number and economic reasons PC2 separate CA families and other families with low son number from the other

74 4. Principal Component Analysis plot(hclust(dist(scale(obreros[,c(2:7)]))),labels=obreros[,1])

75 4. Principal Component Analysis biplot(.pc) text(.pc$scores[,1],.pc$scores[,2],obreros[,1])

More information