Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

Size: px

Start display at page:

Download "Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3"

Joshua Robbins
6 years ago
Views:

1 Machine Learning Nathalie Villa-Vialaneix - nathalie.villa@univ-paris1.fr IUT STID (Carcassonne) & SAMM (Université Paris 1) Formation INRA, Niveau 3 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 1 / 37

2 Outline 1 Introduction Background and notations Undertting / Overtting Errors Use case 2 Neural networks Overview Multilayer perceptron Learning/Tuning 3 CART Introduction Learning Prediction 4 Random forest Overview Bootstrap/Bagging Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 2 / 37

3 Introduction Outline 1 Introduction Background and notations Undertting / Overtting Errors Use case 2 Neural networks Overview Multilayer perceptron Learning/Tuning 3 CART Introduction Learning Prediction 4 Random forest Overview Bootstrap/Bagging Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 3 / 37

4 Introduction Background and notations Background Purpose: predict Y from X ; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

5 Introduction Background and notations Background Purpose: predict Y from X ; What we have: n observations of (X, Y ), (x 1, y 1 ),..., (x n, y n ); Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

6 Introduction Background and notations Background Purpose: predict Y from X ; What we have: n observations of (X, Y ), (x 1, y 1 ),..., (x n, y n ); What we want: estimate unknown Y from new X : x n+1,..., x m. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

7 Introduction Background and notations Background Purpose: predict Y from X ; What we have: n observations of (X, Y ), (x 1, y 1 ),..., (x n, y n ); What we want: estimate unknown Y from new X : x n+1,..., x m. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

8 Introduction Background and notations Background Purpose: predict Y from X ; What we have: n observations of (X, Y ), (x 1, y 1 ),..., (x n, y n ); What we want: estimate unknown Y from new X : x n+1,..., x m. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Y can be: a numeric variable (Y R) (supervised) regression régression; a factor (supervised) classication discrimination. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 4 / 37

9 Introduction Undertting / Overtting Basics From (x i, y i ) i, denition of a machine, Φ n s.t.: ŷ new = Φ n (x new ). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

10 Introduction Undertting / Overtting Basics From (x i, y i ) i, denition of a machine, Φ n s.t.: ŷ new = Φ n (x new ). if Y is numeric, Φ n is called a regression function fonction de classication; if Y is a factor, Φ n is called a classier classieur; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

11 Introduction Undertting / Overtting Basics From (x i, y i ) i, denition of a machine, Φ n s.t.: ŷ new = Φ n (x new ). if Y is numeric, Φ n is called a regression function fonction de classication; if Y is a factor, Φ n is called a classier classieur; Φ n is said to be trained or learned from the observations (x i, y i ) i. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

12 Introduction Undertting / Overtting Basics From (x i, y i ) i, denition of a machine, Φ n s.t.: ŷ new = Φ n (x new ). if Y is numeric, Φ n is called a regression function fonction de classication; if Y is a factor, Φ n is called a classier classieur; Φ n is said to be trained or learned from the observations (x i, y i ) i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

13 Introduction Undertting / Overtting Basics From (x i, y i ) i, denition of a machine, Φ n s.t.: ŷ new = Φ n (x new ). if Y is numeric, Φ n is called a regression function fonction de classication; if Y is a factor, Φ n is called a classier classieur; Φ n is said to be trained or learned from the observations (x i, y i ) i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

14 Introduction Undertting / Overtting Basics From (x i, y i ) i, denition of a machine, Φ n s.t.: ŷ new = Φ n (x new ). if Y is numeric, Φ n is called a regression function fonction de classication; if Y is a factor, Φ n is called a classier classieur; Φ n is said to be trained or learned from the observations (x i, y i ) i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Conicting objectives!! Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 5 / 37

15 Introduction Undertting / Overtting Undertting/Overtting sous/sur - apprentissage Function x y to be estimated Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

16 Introduction Undertting / Overtting Undertting/Overtting sous/sur - apprentissage Observations we might have Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

17 Introduction Undertting / Overtting Undertting/Overtting sous/sur - apprentissage Observations we do have Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

18 Introduction Undertting / Overtting Undertting/Overtting sous/sur - apprentissage First estimation from the observations: undertting Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

19 Introduction Undertting / Overtting Undertting/Overtting sous/sur - apprentissage Second estimation from the observations: accurate estimation Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

20 Introduction Undertting / Overtting Undertting/Overtting sous/sur - apprentissage Third estimation from the observations: overtting Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

21 Introduction Undertting / Overtting Undertting/Overtting sous/sur - apprentissage Summary Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 6 / 37

22 Introduction Errors Errors training error (measures the accuracy to the observations) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

23 Introduction Errors Errors training error (measures the accuracy to the observations) if y is a factor: misclassication rate {ŷ i y i, i = 1,..., n} n Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

24 Introduction Errors Errors training error (measures the accuracy to the observations) if y is a factor: misclassication rate {ŷ i y i, i = 1,..., n} if y is numeric: mean square error (MSE) n 1 n n (ŷ i y i ) 2 i=1 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

25 Introduction Errors Errors training error (measures the accuracy to the observations) if y is a factor: misclassication rate {ŷ i y i, i = 1,..., n} if y is numeric: mean square error (MSE) n 1 n n (ŷ i y i ) 2 i=1 or root mean square error (RMSE) or pseudo-r 2 : 1 MSE/Var((y i ) i ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

26 Introduction Errors Errors training error (measures the accuracy to the observations) if y is a factor: misclassication rate {ŷ i y i, i = 1,..., n} if y is numeric: mean square error (MSE) n 1 n n (ŷ i y i ) 2 i=1 or root mean square error (RMSE) or pseudo-r 2 : 1 MSE/Var((y i ) i ) test error: a way to prevent overtting (estimates the generalization error) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

27 Introduction Errors Errors training error (measures the accuracy to the observations) if y is a factor: misclassication rate {ŷ i y i, i = 1,..., n} if y is numeric: mean square error (MSE) n 1 n n (ŷ i y i ) 2 i=1 or root mean square error (RMSE) or pseudo-r 2 : 1 MSE/Var((y i ) i ) test error: a way to prevent overtting (estimates the generalization error) 1 split the data into training/test sets (usually 80%/20%) 2 train Φ n from the training dataset 3 calculate the test error from the remaining data simple validation Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 7 / 37

28 Introduction Errors Example Observations Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

29 Introduction Errors Example Training/Test datasets Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

30 Introduction Errors Example Training/Test errors Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

31 Introduction Errors Example Summary Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 8 / 37

32 Introduction Errors Linear vs Nonparametric Linear methods: Y = β T X + ɛ (a priori on the type of link between X and Y ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37

33 Introduction Errors Linear vs Nonparametric Linear methods: Y = β T X + ɛ (a priori on the type of link between X and Y ) Here: nonparametric methods: where Φ is totally unknown. Y = Φ(X ) + ɛ Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37

34 Introduction Errors Linear vs Nonparametric Linear methods: Y = β T X + ɛ (a priori on the type of link between X and Y ) Here: nonparametric methods: Y = Φ(X ) + ɛ where Φ is totally unknown. (ML) Objective: Build Φ n from the observations such that its generalization error ELΦ n is (asymptotically) optimal. Example (regression framework) whatever (X, Y ) distribution. ELΦ n := E [ (Φ n (X ) Y ) 2] n + inf Φ ELΦ Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 9 / 37

35 Introduction Use case Use case description Data kindly provided by Laurence Liaubet described in [Liaubet et al., 2011]: microarray data: expression of 272 selected genes over 56 individuals (pigs); a phenotype of interest (muscle ph) measured over the 56 invididuals (numerical variable). le 1: genes expressions le 2: muscle ph Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 10 / 37

36 Neural networks Outline 1 Introduction Background and notations Undertting / Overtting Errors Use case 2 Neural networks Overview Multilayer perceptron Learning/Tuning 3 CART Introduction Learning Prediction 4 Random forest Overview Bootstrap/Bagging Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 11 / 37

37 Neural networks Overview Basics [Bishop, 1995] Common properties (articial) Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37

38 Neural networks Overview Basics [Bishop, 1995] Common properties (articial) Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain; combination (network) of simple elements (neurons). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37

39 Neural networks Overview Basics [Bishop, 1995] Common properties (articial) Neural networks: general name for supervised and unsupervised methods developed in analogy to the brain; combination (network) of simple elements (neurons). Example of graphical representation: INPUTS OUTPUTS Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 12 / 37

40 Neural networks Overview Main features A neural network is dened by: 1 the network structure; 2 the neuron type. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

41 Neural networks Overview Main features A neural network is dened by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression); Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

42 Neural networks Overview Main features A neural network is dened by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

43 Neural networks Overview Main features A neural network is dened by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classication and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Self-organizing maps (SOM): dedicated to unsupervised problems (clustering), self-organized;... Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 13 / 37

44 Neural networks Overview MLP: Advantages/Drawbacks Advantages classication OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; accurate (universal approximation). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 14 / 37

45 Neural networks Overview MLP: Advantages/Drawbacks Advantages classication OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; accurate (universal approximation). Drawbacks hard to train (high computational cost, especially when d is large); overt easily; black box models (hard to interpret) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 14 / 37

46 Neural networks Multilayer perceptron Analogy to the brain Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

47 Neural networks Multilayer perceptron Analogy to the brain Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

48 Neural networks Multilayer perceptron Analogy to the brain Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

49 Neural networks Multilayer perceptron Analogy to the brain If > activation threshold then Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

50 Neural networks Multilayer perceptron Analogy to the brain If < activation threshold then Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 15 / 37

51 Neural networks Multilayer perceptron (articial) Perceptron Layers MLP have one input layer (X values), one output layer (Y values) and several hidden layers (only 1 is necessary); no connections within a layer; connections between two consecutive layers (feedforward). Example: INPUTS genes expressions Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

52 Neural networks Multilayer perceptron (articial) Perceptron Layers MLP have one input layer (X values), one output layer (Y values) and several hidden layers (only 1 is necessary); no connections within a layer; connections between two consecutive layers (feedforward). Example: INPUTS Weighted links genes expressions Layer 1 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

53 Neural networks Multilayer perceptron (articial) Perceptron Layers MLP have one input layer (X values), one output layer (Y values) and several hidden layers (only 1 is necessary); no connections within a layer; connections between two consecutive layers (feedforward). Example: INPUTS genes expressions Layer 1 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

54 Neural networks Multilayer perceptron (articial) Perceptron Layers MLP have one input layer (X values), one output layer (Y values) and several hidden layers (only 1 is necessary); no connections within a layer; connections between two consecutive layers (feedforward). Example: INPUTS genes expressions Layer 1 Layer 2 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

55 Neural networks Multilayer perceptron (articial) Perceptron Layers MLP have one input layer (X values), one output layer (Y values) and several hidden layers (only 1 is necessary); no connections within a layer; connections between two consecutive layers (feedforward). Example: 2 hidden layers MLP INPUTS OUTPUTS genes expressions Layer 1 Layer 2 ph Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 16 / 37

56 Neural networks Multilayer perceptron A neuron v 3 w 3 v 2 v 1 w 2 w 1 + w 0 (Bias Biais) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

57 Neural networks Multilayer perceptron A neuron v 3 w 3 w 2 v 2 w 2 + v 1 w 1 w 0 (Bias Biais) w 1 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

58 Neural networks Multilayer perceptron A neuron v 3 w 3 w 2 v 2 w 2 + v 1 w 1 w 0 (Bias Biais) w 1 Standard activation functions fonctions de lien / d'activation Biologically inspired: Heaviside function Seuil { 0 if t < threshold; S(t) = 1 if not. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

59 Neural networks Multilayer perceptron A neuron v 3 w 3 w 2 v 2 w 2 + v 1 w 1 w 0 (Bias Biais) w 1 Standard activation functions Main issue with the Heaviside function: not continuous! Logistic function S(t) = 1 1+e t Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 17 / 37

60 Neural networks Multilayer perceptron Summary If Y is numeric, linear output: p x R d, F w (x) = j=1 w (2) S j ( d k=1 w (1) x k + w (0) kj j ). w (1) neuron 1 INPUTS x 1 x 2... x d w (1) dp No analytical expression!! +w (0) p neuron p w (2) 1 w (2) p F w (x) OUTPUTS Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 18 / 37

61 Neural networks Multilayer perceptron Summary If Y is a factor, logistic output: p x R d, P(X = C x = x) F w (x) = S j=1 w (2) S j ( d k=1 (with a maximum probability rule for the nal classication) ) w (1) x k + w (0) kj j w (1) neuron 1 INPUTS x 1 x 2... x d w (1) dp +w (0) p neuron p w (2) 1 w (2) p F w (x) OUTPUTS No analytical expression!! Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 18 / 37

62 Neural networks Multilayer perceptron Universal approximation [Hornik et al., 1989] (among others) For any given Φ, smooth enough and any precision ɛ, there exists a one-hidden layer perceptron (with sigmoïd activation functions) that approximates Φ with a precision at most ɛ. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 19 / 37

63 Neural networks Learning/Tuning Learning weights p is given Chose w s.t.: (MSE minimization) w n = arg min w 1 n i=1 n (y i F w (x i )) 2. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37

64 Neural networks Learning/Tuning Learning weights p is given Chose w s.t.: (MSE minimization) Main issues: w n = arg min w 1 n i=1 n (y i F w (x i )) 2. 1 no exact solution ( approximation algorithms, e.g., Newton's method + backpropagation principle): local minima; Erreur Poids Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37

65 Neural networks Learning/Tuning Learning weights p is given Chose w s.t.: (MSE minimization) Main issues: w n = arg min w 1 n i=1 n (y i F w (x i )) 2. 1 no exact solution ( approximation algorithms, e.g., Newton's method + backpropagation principle): local minima; 2 overtting: the larger p is, the more exible the perceptron is and the more it can overt the data. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 20 / 37

66 Neural networks Learning/Tuning Overtting Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 21 / 37

67 Neural networks Learning/Tuning Overtting Weight decay can help improve the generalization ability: w n 1 n = arg min (y i F w (x i )) 2 +λ w 2 w n i=1 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 21 / 37

68 Neural networks Learning/Tuning Tuning p and λ p and λ are called hyperparameters hyper-paramètres (not learned from the data but tuned). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37

69 Neural networks Learning/Tuning Tuning p and λ p and λ are called hyperparameters hyper-paramètres (not learned from the data but tuned). grid search using a simple validation; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37

70 Neural networks Learning/Tuning Tuning p and λ p and λ are called hyperparameters hyper-paramètres (not learned from the data but tuned). grid search using a simple validation; grid search using a (K -fold) cross validation (better but computationally expensive when K is large). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 22 / 37

71 Neural networks Learning/Tuning Cross validation Algorithm 1: Set the grid search for p, G p, and λ, G λ 2: Split the data into K groups 3: for p G p and λ G λ do 4: for group = 1..K do 5: Train model p,λ,group without observations in group 6: Test error, MSE p,λ,group for observations in group 7: end for 8: Average MSE p,λ,group over group MSE p,λ 9: end for 10: Select p and λ with minimum MSE p,λ Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37

72 Neural networks Learning/Tuning Cross validation Algorithm 1: Set the grid search for p, G p, and λ, G λ 2: Split the data into K groups 3: for p G p and λ G λ do 4: for group = 1..K do 5: Train model p,λ,group without observations in group 6: Test error, MSE p,λ,group for observations in group 7: end for 8: Average MSE p,λ,group over group MSE p,λ 9: end for 10: Select p and λ with minimum MSE p,λ n-fold cross validation is called Leave-One-Out (LOO). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37

73 Neural networks Learning/Tuning Cross validation Algorithm 1: Set the grid search for p, G p, and λ, G λ 2: Split the data into K groups 3: for p G p and λ G λ do 4: for group = 1..K do 5: Train model p,λ,group without observations in group 6: Test error, MSE p,λ,group for observations in group 7: end for 8: Average MSE p,λ,group over group MSE p,λ 9: end for 10: Select p and λ with minimum MSE p,λ n-fold cross validation is called Leave-One-Out (LOO). Standard choice for K : LOO or 10-fold CV. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 23 / 37

74 CART Outline 1 Introduction Background and notations Undertting / Overtting Errors Use case 2 Neural networks Overview Multilayer perceptron Learning/Tuning 3 CART Introduction Learning Prediction 4 Random forest Overview Bootstrap/Bagging Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 24 / 37

75 CART Introduction Overview CART: Classication And Regression Trees introduced by [Breiman et al., 1984]. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37

76 CART Introduction Overview CART: Classication And Regression Trees introduced by [Breiman et al., 1984]. Advantages classication OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; can deal with a large number of input variables, either numeric variables or factors (a variable selection is included in the method); provide an intuitive interpretation. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37

77 CART Introduction Overview CART: Classication And Regression Trees introduced by [Breiman et al., 1984]. Advantages classication OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; can deal with a large number of input variables, either numeric variables or factors (a variable selection is included in the method); provide an intuitive interpretation. Drawbacks require a large training dataset to be ecient; as a consequence, are often too simple to provide accurate predictions. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 25 / 37

78 CART Introduction Example X = (Gender, Age, Height) and Y = Weight Root Height<1.60m Height>1.60m a split a node Gender=M Gender=F Age<30 Age>30 Y 1 Y 2 Y 3 a leaf (terminal node) Gender=M Gender=F Y 4 Y 5 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 26 / 37

79 CART Learning CART learning process Algorithm 1: Start from root 2: repeat 3: move to a new node 4: if the node is homogeneous or small enough then 5: STOP 6: else 7: split the node into two child nodes with maximal homogeneity 8: end if 9: until all nodes are processed Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 27 / 37

80 CART Learning Further details Homogeneity? if Y is a numeric variable, variance of (y i ) i for the observations assigned to the node (Gini index is also sometimes used); Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

81 CART Learning Further details Homogeneity? if Y is a numeric variable, variance of (y i ) i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

82 CART Learning Further details Homogeneity? if Y is a numeric variable, variance of (y i ) i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

83 CART Learning Further details Homogeneity? if Y is a numeric variable, variance of (y i ) i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Hyperparameters can be tuned by cross-validation using a grid search. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

84 CART Learning Further details Homogeneity? if Y is a numeric variable, variance of (y i ) i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Hyperparameters can be tuned by cross-validation using a grid search. An alternative approach is pruning... Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 28 / 37

85 CART Learning Choosing an optimal subtree Algorithm 1: Train the maximal tree, T 2: Pruning: Find an optimal subtrees sequence (T k ) k=1,...,k 3: By cross validation, nd the errors L(T k ) + λc(t k ) for k = 1,..., K where L is the error and C is a complexity measure (number of leafs) 4: Select the subtree s.t. L(T k ) + λc(t k ) is minimum Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 29 / 37

86 CART Prediction Making new predictions A new observation, x new is assigned to a leaf (straightforward); Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 30 / 37

87 CART Prediction Making new predictions A new observation, x new is assigned to a leaf (straightforward); the corresponding predicted ŷ new is if Y is numeric, the mean value of the observations (training set) assigned to the same leaf; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 30 / 37

88 CART Prediction Making new predictions A new observation, x new is assigned to a leaf (straightforward); the corresponding predicted ŷ new is if Y is numeric, the mean value of the observations (training set) assigned to the same leaf; if Y is a factor, the majority class of the observations (training set) assigned to the same leaf. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 30 / 37

89 Random forest Outline 1 Introduction Background and notations Undertting / Overtting Errors Use case 2 Neural networks Overview Multilayer perceptron Learning/Tuning 3 CART Introduction Learning Prediction 4 Random forest Overview Bootstrap/Bagging Random forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 31 / 37

90 Random forest Overview Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001]. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 32 / 37

91 Random forest Overview Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001]. Advantages classication OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small samples. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 32 / 37

92 Random forest Overview Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001]. Advantages classication OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small samples. Drawbacks black box model; is not supported by strong mathematical results (consistency...) until now. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 32 / 37

93 Random forest Bootstrap/Bagging Basic description A fact: When the sample size is small, you might be unable to estimate properly. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 33 / 37

94 Random forest Bootstrap/Bagging Basic description A fact: When the sample size is small, you might be unable to estimate properly. This issue is commonly tackled by bootstrapping and, more specically, bagging (Boostrap Aggregating). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 33 / 37

95 Random forest Bootstrap/Bagging Basic description A fact: When the sample size is small, you might be unable to estimate properly. This issue is commonly tackled by bootstrapping and, more specically, bagging (Boostrap Aggregating). Bagging: combination of simple (and underecient) regression (or classication) functions; Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 33 / 37

96 Random forest Bootstrap/Bagging Basic description A fact: When the sample size is small, you might be unable to estimate properly. This issue is commonly tackled by bootstrapping and, more specically, bagging (Boostrap Aggregating). Bagging: combination of simple (and underecient) regression (or classication) functions; Random forest CART bagging. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 33 / 37

97 Random forest Bootstrap/Bagging Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 34 / 37

98 Random forest Bootstrap/Bagging Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset. Standard bootstrap sample size: 2/3n. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 34 / 37

99 Random forest Bootstrap/Bagging Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset. Standard bootstrap sample size: 2/3n. General (and robust) approach to solve several problems: Estimating condence intervals (of X with no prior assumption on the distribution of X ) 1 Build P bootstrap samples from (x i ) i 2 Use them to estimate X P times 3 The condence interval is based on the percentiles of the empirical distribution of X Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 34 / 37

Random forest Bootstrap/Bagging Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset. Standard bootstrap sample size: 2/3n.

100 Random forest Bootstrap/Bagging Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset. Standard bootstrap sample size: 2/3n. General (and robust) approach to solve several problems: Estimating condence intervals (of X with no prior assumption on the distribution of X ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 34 / 37

101 Random forest Bootstrap/Bagging Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset. Standard bootstrap sample size: 2/3n. General (and robust) approach to solve several problems: Estimating condence intervals (of X with no prior assumption on the distribution of X ) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 34 / 37

102 Random forest Bootstrap/Bagging Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset. Standard bootstrap sample size: 2/3n. General (and robust) approach to solve several problems: Estimating condence intervals (of X with no prior assumption on the distribution of X ) Also useful to estimate p-values, residuals,... Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 34 / 37

103 Random forest Bootstrap/Bagging Bagging Average the estimates of the regression (or the classication) function obtained from B bootstrap samples. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 35 / 37

104 Random forest Bootstrap/Bagging Bagging Average the estimates of the regression (or the classication) function obtained from B bootstrap samples. Bagging with regression trees 1: for b = 1,..., B do 2: Construct a bootstrap sample ξ b 3: Train a regression tree from ξ b, ˆφ b 4: end for 5: Estimate the regression function by ˆΦ n (x) = 1 B B ˆφ b (x). b=1 Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 35 / 37

105 Random forest Bootstrap/Bagging Bagging Average the estimates of the regression (or the classication) function obtained from B bootstrap samples. Bagging with regression trees 1: for b = 1,..., B do 2: Construct a bootstrap sample ξ b 3: Train a regression tree from ξ b, ˆφ b 4: end for 5: Estimate the regression function by ˆΦ n (x) = 1 B B ˆφ b (x). b=1 For classication, the predicted class is the majority vote class. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 35 / 37

106 Random forest Random forest Random forests CART bagging with additional disturbances 1 each node is based on a random (and dierent) subset of q variables (an advisable choice for q is p for classication and p/3 for regression). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 36 / 37

107 Random forest Random forest Random forests CART bagging with additional disturbances 1 each node is based on a random (and dierent) subset of q variables (an advisable choice for q is p for classication and p/3 for regression). 2 the tree is restricted to the rst l nodes with l small. Hyperparameters those of the CART algorithm; those that are specic to the random forest: q, bootstrap sample size, number of trees. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 36 / 37

108 Random forest Random forest Random forests CART bagging with additional disturbances 1 each node is based on a random (and dierent) subset of q variables (an advisable choice for q is p for classication and p/3 for regression). 2 the tree is restricted to the rst l nodes with l small. Hyperparameters those of the CART algorithm; those that are specic to the random forest: q, bootstrap sample size, number of trees. Random forest are not very sensitive to hyper-parameters setting: default values for q and bootstrap sample size (2n/3) should work in most cases. Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 36 / 37

109 Random forest Random forest Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the bag Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 37 / 37

110 Random forest Random forest Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the bag Stabilization of OOB error is a good indication that there is enough trees in the forest Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 37 / 37

111 Random forest Random forest Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the bag Importance of a variable to help interpretation: for a given variable X j 1: randomize the values of the variable 2: make predictions from this new dataset 3: the importance is the mean decrease in accuracy (MSE or misclassication rate) Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 37 / 37

112 References Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press, New York, USA. Breiman, L. (2001). Random forests. Machine Learning, 45(1):532. Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984). Classication and Regression Trees. Chapman and Hall, New York, USA. Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feed-forward networks are universal approximators. Neural Networks, 2: Liaubet, L., Lobjois, V., Faraut, T., Tircazes, A., Benne, F., Iannuccelli, N., Pires, J., Glénisson, J., Robic, A., Le Roy, P., SanCristobal, M., and Cherel, P. (2011). Genetic variability or transcript abundance in pig peri-mortem skeletal muscle: eqtl localized genes involved in stress response, cell death, muscle disorders and metabolism. BMC Genomics, 12(548). Formation INRA (Niveau 3) Machine Learning Nathalie Villa-Vialaneix 37 / 37

Machine Learning - TP

Machine Learning - TP Nathalie Villa-Vialaneix - nathalie.villa@univ-paris1.fr http://www.nathalievilla.org IUT STID (Carcassonne) & SAMM (Université Paris 1) Formation INRA, Niveau 3 Formation INRA (Niveau