Optimal decision and naive Bayes

Size: px

Start display at page:

Download "Optimal decision and naive Bayes"

Hilary Holmes
6 years ago
Views:

1 Optimal decision and naive Bayes Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne

2 Outline Standard supervised learning hypothesis Optimal model The Naive Bayes Classifier 2

3 Data model Data spaces X : input space, always observed Y: output space, values to predict, observed during learning Minimal structural assumption X should be equipped with a dissimilarity d: d is a function from X X to R + d is symmetric X, X X, d(x, X) d(x, X ) Multivariate assumption In general X is structured into variables : X = X 1 X 2... X P X = (X 1, X 2,..., X P ) T 3

4 Predictive model Definition and use A predictive model is a function g from X to Y for an observation (X, Y) one expects g(x) to be close to Y Loss function A loss function l is a function from Y Y to R + such that Y Y, l(y, Y) = 0 Interpretation l(g(x), Y) measures the loss incurred by the user of a model g when the true value Y is replaced by the prediction g(x). 4

5 Examples Y = R l 2 (p, t) = (p t) 2 l 1 (p, t) = p t l APE (p, t) = p t t Y < l b (p, t) = 1 p t general case when Y = {y 1, y 2 } l(p, t) t = y 1 t = y 2 p = y 1 0 l(y 1, y 2 ) p = y 2 l(y 2, y 1 ) 0 asymmetric costs are important in practice (think SPAM versus non SPAM) 5

6 Stochastic framework Main hypotheses observations are random variables with values in X Y they are distributed according to a fixed and unknown distribution D observations are independent A data set D = ((X i, Y i )) 1 i N (X i, Y i ) D and D D N notation: X i = (X i1, X i2,..., X ip ) T Risk of a model The risk of g for the loss function l is R l (g) = E (X,Y) D (l(g(x), Y)) 6

7 Technical aspects Additional hypotheses there is an underlying probability space X Y must be a measurable space in general this is done via the standard Borel sigma field on R d Measurability loss functions must be measurable functions ditto for models technically, the loss could be + Independence and stationarity independence can be relaxed for e.g. time series stationrity also for e.g. drift analysis 7

8 Outline Standard supervised learning hypothesis Optimal model The Naive Bayes Classifier 8

9 Optimal model Optimal risk R l = inf g R l (g) called the Bayes risk when Y is finite Is Rl reachable? in general arg min g R l (g) is a set: could it be empty? gl : a model such that R l (gl ) = Rl 9

10 Optimal model Optimal risk R l = inf g R l (g) called the Bayes risk when Y is finite Is Rl reachable? in general arg min g R l (g) is a set: could it be empty? gl : a model such that R l (gl ) = Rl Quadratic case Y = R, l 2 (p, v) = (p v) 2 g (x) = E (X,Y ) D (Y X = x) 9

11 Discrete case If Y is finite a loss function is a table with Y ( Y 1) non zero entries g can be obtained using conditional probabilities P(Y = y X = x) Simple case l b (p, t) = 1 p t gl b (x) = arg max y Y P (X,Y) D (Y = y X = x) General case gl (x) = arg min l(y, y )P (X,Y ) D (Y = y X = x) y Y y y 10

12 Idea of the proof conditional reasoning { R l (g) = E (X,Y ) D E(X,Y ) D (l(g(x), Y ) X = x) } standard properties of the expectation E (X,Y ) D (l(g(x), Y ) X = x) = l(g(x), y )P (X,Y ) D (Y = y X = x) y Y pointwise minimization and l(y, y) = 0 gives the result 11

13 Remarks Discriminant versus Generative P(X = x Y = y)p(y = y) Bayes rule: P(Y = y X = x) = P(X = x) for a fixed x, one can compare the P(X = x Y = y)p(y = y) rather than the P(Y = y X = x) Y given X: discriminant, X given Y : generative Y = {y 1, y 2 } gl y 1 if l(y 1, y 2 )P(Y = y 2 X = x) (x) = l(y 2, y 1 )P(Y = y 1 X = x) 1 in the other case y 2 Ratios of probabilities are sufficient to compute the optimal model 12

14 Outline Standard supervised learning hypothesis Optimal model The Naive Bayes Classifier 13

15 Generative models Generative model a model that explains both X and Y as opposed to Y given X one road to optimal models Hypotheses X = X 1 X 2... X P Y and all the X j are finite sets Probability distributions on X Y X 1 X 2... X P Y 1 parameters generally intractable 14

16 A simple model Conditional independence explanatory variables are assumed independent given the target variable 1 j P X j Y and thus P(X = x Y = y) = P P(X j = x j Y = y) j=1 Consequences ( only 1 + ) P j=1 ( X j 1) Y 1 parameters easy estimation but very strong assumption 15

17 Naive Bayes Categorical distribution arbitrary distribution on X j = parameter vector Γ = (γ 1,..., γ Xj ) { } u (j) 1,..., u(j) X j P X C(Γ) (X = u (j) l ) = γ l and by extension P X C(Γ) (X = u) = γ u Naive Bayes distribution Γ = (Γ Y, (Γ j,y ) 1 j P,y Y ) distribution on X Y P (X,Y) NB(Γ) (X = x, Y = y) = P P Y C(ΓY )(Y = y) P Xj Y =y C(Γ j,y )(X j = x j Y = y) j=1 16

18 Maximum Likelihood Estimation MLE ΓMLE = arg max Γ P(D Γ) equivalently maximizing log P(D Γ) Naive Bayes log Likelihood standard i.i.d. assumptions separability log P(D Γ) = = N log P (X,Y) NB(Γ) (X = X i, Y = Y i ) i=1 N i=1 j=1 P log P Xj Y =y i C(Γ j,y )(X j = X ij Y = Y i ) N + log P Y C(ΓY )(Y = Y i ) i=1 17

19 MLE cont. Separate estimations parameters of the different categorical distributions are independent ΓMLE can be computed block by block Γ Y MLE = arg max Γ Y Γ j,y MLE = arg max Γ j,y N log P Y C(ΓY )(Y = Y i ) i=1 N i=1,y i =y log P Xj Y =y C(Γ j,y )(X j = X ij Y = y) Consequences simple unidimensional estimation conditional frequencies 18

20 Frequencies Target variable Explanatory variables y Y, γ Y,y MLE = {i Y i = y} N y Y, j, x X j, γ j,y,x MLE = {i Y i = y and X i,j = x} {i Y i = y} Computational cost Very efficient method: O(NP) (with a one pass algorithm) 19

21 Extensions Infinite discrete set X j = N replace the categorical distribution by e.g. a Poisson distribution (or any distribution on N) frequency based estimation, e.g. if X j Y = y is a Poisson distribution with parameter λ j,y then i,y λ j,y MLE = i =y X i,j {i Y i = y} Continuous variables X j = R same principle: choose a parametric distribution on R, e.g. the Gaussian distribution block based estimation 20

22 Naive Bayes Classifier Strategy estimate the parameters of a NB model NB(Γ) for a data set approximate the data distribution by the NB distribution i.e. P (X,Y ) D (x, y) P (X,Y ) NB( ΓMLE )(x, y) use the approximation to compute the optimal classifier, i.e. gl (x) arg min l(y, y )P (X,Y ) NB( ΓMLE ) (Y = y X = x) y Y y y the classifier is optimal only for data distributed exactly according to NB( Γ MLE ): this is seldom the case in practice! 21

23 Naive Bayes Classifier Example as x is fixed when one computes g l (x), only P (X,Y ) NB( ΓMLE ) (Y = y, X = x) is needed we have P (X,Y ) NB( ΓMLE )(Y = y, X = x) = and thus P P Y C( ΓY MLE ) (Y = y) P Xj Y =y C( Γ j,y MLE ) (X j = x j Y = y) j=1 P P (X,Y ) NB( ΓMLE ) (Y = y, X = x) = γ Y,y MLE j=1 γ j,y,xj MLE very simple frequency comparison! 22

24 In practice Pros and Cons + very fast + handles mixed data easily - limited predictive performances compared to state-of-the-art methods - needs a very good set of explanatory variables Best practices avoid using the NBC when frequencies are very close to 0 or 1 use a variable selection method 23

25 Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. 24

26 Version Last modification: By: Fabrice Rossi Git hash: 1b0849e751c9b992d0eb957563be feaa 25

Empirical Risk Minimization

Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space