Machine Learning - TP

Size: px

Start display at page:

Download "Machine Learning - TP"

Posy Owen
5 years ago
Views:

1 Machine Learning - TP Nathalie Villa-Vialaneix - nathalie.villa@univ-paris1.fr IUT STID (Carcassonne) & SAMM (Université Paris 1) Formation INRA, Niveau 3 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 1 / 42

Installing new packages (has to be done only once) with the command line: install.

2 Packages in R is provided with basic functions but more than 3, 000 packages are available on the CRAN (Comprehensive R Archive Network) for additionnal functions (see also the project Bioconductor). Installing new packages (has to be done only once) with the command line: install. packages ( c ( " nnet "," e1071 "," rpart "," car ", " randomforest " )) with the menu (Windows or Mac OS X) Loading a package (has to be done each time R is re-started) library ( nnet ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 2 / 42

3 Working directory All les in proper directories/subdirectories can be downloaded at: either as individual les or as a full zip le inra-package.zip. The companion script ML-scriptR.R is made to be run from the subdirectory ML/TP of the provided material. For the computers in the classroom, this can be set by the following command line: setwd ( " / home / fp / Bureau / inra - package / ML / TP " ) If you are using Windows or Mac OS X, you can also choose to do it from the menu (in Fichier / Dénir le répertoire de travail). In any case, you must adapt the working directory if you want to run the script on another computer! Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 3 / 42

4 Outline 1 Introduction: Data importation and exploration 2 Neural networks 3 CART 4 Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 42

5 Introduction: Data importation and exploration Outline 1 Introduction: Data importation and exploration 2 Neural networks 3 CART 4 Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 42

6 Introduction: Data importation and exploration Use case description Data kindly provided by Laurence Liaubet described in [Liaubet et al., 2011]: microarray data: expression of 272 selected genes over 57 individuals (pigs); a phenotype of interest (muscle ph) measured over the 57 invididuals (numerical variable). le 1: genes expressions le 2: muscle ph Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 42

7 Introduction: Data importation and exploration Load data # L o a d i n g g e n e s e x p r e s s i o n s d <- read. table ( ".. /.. / Data / sel _ data. csv ", sep = " ; ", header =T, row. names =1, dec = "," ) dim ( d ) names ( d ) summary ( d ) # L o a d i n g p H ph <- scan ( ".. /.. / Data / sel _ ph. csv ", dec = "," ) length ( ph ) summary ( ph ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 42

8 Introduction: Data importation and exploration Basic analysis # D a t a d i s t r i b u t i o n layout ( matrix ( c (1,1,2), ncol =3)) boxplot (d, main = " genes expressions " ) boxplot ( ph, main = " ph " ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 42

9 Introduction: Data importation and exploration Correlation analysis # C o r r e l a t i o n a n a l y s i s heatmap ( as. matrix ( d )) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 42

$cor ) hist ( totalcor, main = " Correlations between genes expression \ n and ph ", xlab = " Correlations " ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix$

10 Introduction: Data importation and exploration Correlation with ph # C o r r e l a t i o n w i t h p H ph. cor <- function ( x ) { cor (x, ph )} totalcor <- apply (d,1, ph. cor ) hist ( totalcor, main = " Correlations between genes expression \ n and ph ", xlab = " Correlations " ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 10 / 42

11 Introduction: Data importation and exploration Classication and regression tasks 1 Regression: predict ph (numerical variable) from genes expressions. Useful to: help verify the strength of the relation between genes expressions and ph; help understand the nature of the relation. 2 Classication: predict whether the ph is smaller or greater than 5.7 from genes expressions. (toy example) # C l a s s e s d e f i n i t i o n ph. classes <- rep (0, length ( ph )) ph. classes [ ph >5.7] <- 1 table ( ph. classes ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 11 / 42

12 Introduction: Data importation and exploration Train/Test split: regression framework # I n i t i a l i z a t i o n a n d t r a i n i n g s a m p l e s e l e c t i o n set. seed ( ) training <- sample (1: ncol ( d ), round (0.8 * ncol ( d )), replace = F ) # M a t r i c e s d e f i n i t i o n d. train <- t ( d [, training ]) ph. train <- ph [ training ] d. test <- t ( d [, - training ]) ph. test <- ph [ - training ] # D a t a f r a m e s d e f i n i t i o n r. train <- cbind ( d. train, ph. train ) r. test <- cbind ( d. test, ph. test ) colnames ( r. train ) <- c ( colnames ( d. train ), " ph " ) colnames ( r. test ) <- c ( colnames ( d. train ), " ph " ) r. train <- data. frame ( r. train ) r. test <- data. frame ( r. test ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 42

13 Introduction: Data importation and exploration Train/Test split: classication framework # V e c t o r s d e f i n i t i o n phc. train <- ph. classes [ training ] phc. test <- ph. classes [ - training ] # D a t a f r a m e s d e f i n i t i o n c. train <- cbind ( d. train, phc. train ) c. test <- cbind ( d. test, phc. test ) colnames ( c. train ) <- c ( colnames ( d. train ), " phc " ) colnames ( c. test ) <- c ( colnames ( d. train ), " phc " ) c. train <- data. frame ( c. train ) c. test <- data. frame ( c. test ) # T r a n s f o r m i n g p H c i n t o a f a c t o r c. train $ phc <- factor ( c. train $ phc ) c. test $ phc <- factor ( c. test $ phc ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 42

14 Introduction: Data importation and exploration Training / Test sets (short) analysis layout ( matrix ( c (1,1,2,3,3,4), ncol =3, nrow =2, byrow = T )) boxplot ( d. train, main = " genes expressions ( train ) " ) boxplot ( ph. train, main = " ph ( train ) " ) boxplot ( d. test, main = " genes expressions ( test ) " ) boxplot ( ph. test, main = " ph ( test ) " ) table ( phc. train ) table ( phc. test ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 14 / 42

15 Neural networks Outline 1 Introduction: Data importation and exploration 2 Neural networks 3 CART 4 Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 42

16 Neural networks Loading the data (to be consistent) and nnet library # L o a d i n g t h e d a t a load ( ".. /.. / Data / train - test. Rdata " ) MLP are unusable with a large number of predictors (here: 272 predictors for 45 observations only in the training set) selection of a relevant subset of variables (with LASSO): # L o a d i n g t h e s u b s e t o f p r e d i c t o r s load ( ".. /.. / Data / selected. Rdata " ) MLPs are provided in the nnet package: # L o a d i n g n n e t l i b r a r y library ( nnet ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 42

17 Neural networks Simple use: MLP in the regression framework Training # S i m p l e u s e : M L P w i t h p = 3 a n d n o d e c a y set. seed ( ) nn1 <- nnet ( d. train [, selected ], ph. train, size =3, Analysis print ( nn1 ) summary ( nn1 ) decay =0, maxit =500, linout = T ) # T r a i n i n g e r r o r a n d p s e u d o - R 2 mean (( ph. train - nn1 $ fitted )^2) 1 - mean (( ph. train - nn1 $ fitted )^2) / var ( ph. train ) # P r e d i c t i o n s ( t e s t s e t ) pred. test <- predict ( nn1, d. test [, selected ]) # T e s t e r r o r a n d p s e u d o - R 2 mean (( ph. test - pred. test )^2) 1 - mean (( ph. test - pred. test )^2) / var ( ph. train ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 42

18 Neural networks Predictions vs observations plot ( ph. train, nn1 $ fitted, xlab = " Observations ", ylab = " Fitted values ", main = " ", pch =3, col = " blue " ) points ( ph. test, pred. test, pch =19, col = " darkred " ) legend ( " bottomright ", pch = c (3,19), col = c ( " blue ", " darkred " ), legend = c ( " Train "," Test " )) abline (0,1, col = " darkgreen " ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 18 / 42

19 Neural networks Check results variability set. seed ( ) nn2 <- nnet ( d. train [, selected ], ph. train, size =3, decay =0, maxit =500, linout = T )... Summary Train Test nn1 30.8% 44.9% nn2 99.8% -34!! Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 19 / 42

20 Neural networks Simple use: MLP in the classication framework Training # S i m p l e u s e : M L P w i t h p = 3 a n d n o d e c a y set. seed ( ) nnc1 <- nnet ( phc ~., data = c. train, size =3, decay =0, Analysis print ( nnc1 ) summary ( nnc1 ) # P r e d i c t i o n s nnc1 $ fitted # R e c o d i n g maxit =500, linout = F ) pred. train <- rep (0, length ( nnc1 $ fitted )) pred. train [ nnc1 $ fitted >0.5] <- 1 # T r a i n i n g e r r o r table ( pred. train, c. train $ phc ) sum ( pred. train! = c. train $ phc ) / length ( pred. train ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 42

21 Neural networks Test error # P r e d i c t i o n s a n d r e c o d i n g raw. pred. test <- predict ( nnc1, c. test ) pred. test <- rep (0, length ( raw. pred. test )) pred. test [ raw. pred. test >0.5] <- 1 # T e s t e r r o r table ( pred. test, c. test $ phc ) sum ( pred. test! = c. test $ phc ) / length ( pred. test ) Summary Train Observations 0 1 Predictions Test Observations 0 1 Predictions Overall misclassication rate: 2.22% 45.45% Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 21 / 42

22 Neural networks Tuning MLP with the e1071 package Search for the best parameters with 10-fold CV library ( e1071 ) set. seed (1643) t. nnc2 <- tune. nnet ( phc ~., data = c. train [, c ( selected, ncol ( c. train ))], size = c (2,5,10), decay =10^( - c (10,8,6,4,2)), maxit =500, linout =F, tunecontrol = tune. control ( nrepeat =5, sampling = " cross ", cross =10)) Basic analysis of the output # L o o k i n g f o r t h e b e s t p a r a m e t e r s plot ( t. nnc2 ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 42

23 Neural networks Best parameters? summary ( t. nnc2 ) t. nnc2 $ best. parameters # S e l e c t i n g t h e b e s t M L P nnc2 <- t. nnc2 $ best. model Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 42

24 Neural networks Results analysis # T r a i n i n g e r r o r pred. train <- rep (0, length ( nnc2 $ fitted )) pred. train [ nnc2 $ fitted >0.5] <- 1 table ( pred. train, c. train $ phc ) sum ( pred. train! = c. train $ phc ) / length ( pred. train ) # P r e d i c t i o n s a n d t e s t e r r o r raw. pred. test <- predict ( nnc2, c. test ) pred. test <- rep (0, length ( raw. pred. test )) pred. test [ raw. pred. test >0.5] <- 1 table ( pred. test, c. test $ phc ) sum ( pred. test! = c. test $ phc ) / length ( pred. test ) Summary Train Test nnc1 2.22% 45.45% nnc2 0% 22.27% Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 24 / 42

25 CART Outline 1 Introduction: Data importation and exploration 2 Neural networks 3 CART 4 Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 42

26 CART Regression tree training Training with the rpart package library ( rpart ) # R e g r e s s i o n t r e e tree1 <- rpart ( ph ~., data = r. train, control = rpart. control ( minsplit =2)) Basic analysis of the output print ( tree1 ) summary ( tree1 ) plot ( tree1 ) text ( tree1 ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 26 / 42

27 CART Resulting tree tree1 $ where # l e a f n u m b e r tree1 $ frame # n o d e s f e a t u r e s Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 27 / 42

28 CART Performance analysis # T r a i n i n g p r e d i c t i o n s a n d e r r o r pred. train <- tree1 $ frame $ yval [ tree1 $ where ] mean (( pred. train - r. train $ ph )^2) 1 - mean (( pred. train - r. train $ ph )^2) / var ( ph. train ) # T e s t p r e d i c t i o n s a n d e r r o r pred. test <- predict ( tree1, r. test ) mean (( pred. test - r. test $ ph )^2) 1 - mean (( pred. test - r. test $ ph )^2) / var ( ph. train ) # F i t t e d v a l u e s v s T r u e v a l u e s plot ( ph. train, pred. train, xlab = " Observations ", ylab = " Fitted values ", main = " ", pch =3, col = " blue " ) points ( ph. test, pred. test, pch =19, col = " darkred " ) legend ( " bottomright ", pch = c (3,19), col = c ( " blue ", " darkred " ), legend = c ( " Train "," Test " )) abline (0,1, col = " darkgreen " ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 42

29 CART Summary Numerical performance Train Test tree1 96.6% 11.6% nn2 99.8% -34!! Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 29 / 42

30 CART Classication tree training and tuning Training with the rpart package and tuning with the e1071 package # R a n d o m s e e d i n i t i a l i z a t i o n set. seed ( ) # T u n i n g t h e m i n i m a l n u m b e r o f o b s e r v a t i o n s i n a # n o d e ( m i n s p l i t ; d e f a u l t v a l u e i s 2 0 ) t. treec1 <- tune. rpart ( phc ~., data = c. train, minsplit = c (4,6,8,10), tunecontrol = tune. control ( sampling = " bootstrap ", nboot =20)) Basic analysis of the output plot ( t. treec1 ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 30 / 42

31 CART Tuning results t. treec1 $ best. parameters treec1 <- t. treec1 $ best. model Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 31 / 42

32 CART Basic analysis of the best tree summary ( treec1 ) plot ( treec1 ) text ( treec1 ) treec1 $ where # l e a f n u m b e r treec1 $ frame # n o d e s f e a t u r e s Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 32 / 42

33 CART Predictions and errors Training set # M a k e t h e p r e d i c t i o n pred. train <- predict ( treec1, c. train ) # F i n d o u t w h i c h c l a s s i s p r e d i c t e d pred. train <- apply ( pred. train,1, which. max ) library ( car ) # t o h a v e t h e " r e c o d e " f u n c t i o n pred. train <- recode ( pred. train, " 2=1;1=0 " ) # C a l c u l a t e m i s c l a s s i f i c a t i o n e r r o r table ( pred. train, phc. train ) sum ( pred. train! = phc. train ) / length ( pred. train ) Test set pred. test <- predict ( treec1, c. test ) pred. test <- apply ( pred. test,1, which. max ) pred. test <- recode ( pred. test, " 2=1;1=0 " ) table ( pred. test, phc. test ) sum ( pred. test! = phc. test ) / length ( pred. test ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 33 / 42

34 CART Classication performance summary Overall misclassication rates Train Test nnc1 2.22% 45.45% nnc % treec % Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 34 / 42

35 Random forest Outline 1 Introduction: Data importation and exploration 2 Neural networks 3 CART 4 Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 35 / 42

36 Random forest Random forest training (regression framework) Training with the randomforest package library ( randomforest ) set. seed ( ) rf1 <- randomforest ( d. train, ph. train, importance =T, keep. forest = T ) Basic analysis of the output plot ( rf1 ) rf1 $ ntree rf1 $ mtry rf1 $ importance Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 36 / 42

37 Random forest Predictions and errors # T r a i n i n g s e t ( o o b e r r o r a n d t r a i n i n g e r r o r ) mean (( ph. train - rf1 $ predicted )^2) rf1 $ mse 1 - mean (( ph. train - rf1 $ predicted )^2) / var ( ph. train ) 1 - mean (( ph. train - predict ( rf1, d. train ))^2) # T e s t s e t / var ( ph. train ) pred. test <- predict ( rf1, d. test ) mean (( ph. test - pred. test )^2) 1 - mean (( ph. test - pred. test )^2) / var ( ph. train ) MSE Train Test nn1 30.8% 44.9% nn2 99.8% -34 rf1 84.5% 51.2%% Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 37 / 42

38 Random forest Predictions vs observations plot ( ph. train, rf1 $ predicted, xlab = " Observations ", ylab = " Fitted values ", main = " ", pch =3, col = " blue " ) points ( ph. test, pred. test, pch =19, col = " darkred " ) legend ( " bottomright ", pch = c (3,19), col = c ( " blue ", " darkred " ), legend = c ( " Train "," Test " )) abline (0,1, col = " darkgreen " ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 38 / 42

39 Random forest Importance analysis layout ( matrix ( c (1,2), ncol =2)) barplot ( t ( rf1 $ importance [,1]), xlab = " variables ", ylab = " % Inc. MSE ", col = " darkred ", las =2, names = rep ( NA, nrow ( rf1 $ importance ))) barplot ( t ( rf1 $ importance [,2]), xlab = " variables ", ylab = " Inc. Node Purity ", col = " darkred ", las =2, names = rep ( NA, nrow ( rf1 $ importance ))) which ( rf1 $ importance [,1] >0.0005) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 39 / 42

Random forest Random forest training (classication framework) Training with the randomforest package (advanced features) set. seed (20011203) rfc1 <- randomforest ( d. train, factor ( phc.

40 Random forest Random forest training (classication framework) Training with the randomforest package (advanced features) set. seed ( ) rfc1 <- randomforest ( d. train, factor ( phc. train ), ntree =5000, mtry =150, sampsize =40, nodesize = 2, xtest = d. test, ytest = factor ( phc. test ), importance =T, keep. forest = F ) Basic analysis of the output plot ( rfc1 ) rfc1 $ importance Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 40 / 42

41 Random forest Predictions and errors # T r a i n i n g s e t ( o o b e r r o r a n d t r a i n i n g e r r o r ) table ( rfc1 $ predicted, phc. train ) sum ( rfc1 $ predicted! = phc. train ) / length ( phc. train ) # T e s t s e t table ( rfc1 $ test $ predicted, phc. test ) sum ( rfc1 $ test $ predicted! = phc. test ) / length ( phc. test ) Overall misclassication rates Train Test nnc1 2.22% 45.45% nnc % treec % rfc % 36.36% Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 41 / 42

Random forest Importance analysis # I m p o r t a n c e a n a l y s i s layout ( matrix ( c (1,2), ncol =2)) barplot ( t ( rfc1 $ importance [,3]), xlab = " variables ", ylab = " Mean Dec. Accu.

42 Random forest Importance analysis # I m p o r t a n c e a n a l y s i s layout ( matrix ( c (1,2), ncol =2)) barplot ( t ( rfc1 $ importance [,3]), xlab = " variables ", ylab = " Mean Dec. Accu. ", col = " darkred ", las =2, names = rep ( NA, nrow ( rfc1 $ importance ))) barplot ( t ( rfc1 $ importance [,4]), xlab = " variables ", ylab = " Mean Dec. Gini ", col = " darkred ", las =2, names = rep ( NA, nrow ( rfc1 $ importance ))) which ( rfc1 $ importance [,3] >0.002) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 42 / 42

43 Random forest Liaubet, L., Lobjois, V., Faraut, T., Tircazes, A., Benne, F., Iannuccelli, N., Pires, J., Glénisson, J., Robic, A., Le Roy, P., SanCristobal, M., and Cherel, P. (2011). Genetic variability or transcript abundance in pig peri-mortem skeletal muscle: eqtl localized genes involved in stress response, cell death, muscle disorders and metabolism. BMC Genomics, 12(548). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 42 / 42

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3 Machine Learning Nathalie Villa-Vialaneix - nathalie.villa@univ-paris1.fr http://www.nathalievilla.org IUT STID (Carcassonne) & SAMM (Université Paris 1) Formation INRA, Niveau 3 Formation INRA (Niveau