Discriminative v. generative

Size: px

Start display at page:

Download "Discriminative v. generative"

Molly Edwards
5 years ago
Views:

1 Discriminative v. generative

2 Naive Bayes 2

3 Naive Bayes P (x ij,y i )= Y i P (y i ) Y j P (x ij y i ) P (y i =+)=p MLE: max P (x ij,y i ) a j,b j,p p = 1 N P [yi =+] P (x ij =1 y i = ) = a j P (x ij =1 y i =+)=b j a j = P [(y i = ) ^ (x ij = 1)]/ P [y i = ] b j = P [(y i =+)^ (x ij = 1)]/ P [y i = ] P (y i =+ x ij )=1/(1 + exp( z i )) 2k+1 parameters z i = w 0 + P j w jx ij 3

4 Logistic regression P (y i =+ x ij )=1/(1 + exp( z i )) arg max w Y i = arg min w = arg min w z i = w 0 + P j w jx ij P (y i x ij ) X ln(1 + exp( y i z i )) i X h(y i z i ) i 4

5 Same model, different answer Why? max P(X, Y) vs. max P(Y X) generative vs. discriminative MLE v. MCLE (max conditional likelihood estimate) How to pick? Typically MCLE better if lots of data, MLE better if not 5

6 MCLE as MLE max Y i P (x i,y i ) max Y i P (y i x i, )

7 MCLE as MLE Recipe: MCLE = MLE + extra parameters to decouple P(x) from P(y x) Bias-variance tradeoff: MLE places additional constraints on θ by coupling to P(x) MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θx to θy w/ prior Tom Minka. Discriminative models, not discriminative training. MSR tech report TR ,

8 Comparison As #examples if Bayes net is right: NB & LR get same answer if not: LR has minimum possible training error train error test error so LR does at least as well as NB, usually better 8

9 Comparison Finite sample: n examples with k attributes how big should n be for excess risk ϵ? GNB needs n = θ(log k) as long as at least a constant fraction of attributes are relevant Hoeffding for each weight + union bound over weights + bound z away from 0 LR needs n = θ(k) VC-dimension of linear classifier GNB converges much faster to its (perhaps lessaccurate) final estimates see [Ng & Jordan, 2002] 9

10 Comparison on UCI 0.4 voting records (discrete) 0.5 pima (continuous) error 0.2 error m m NB: solid LR: dashed see [Ng & Jordan, 2002] 10

11 Comparison on UCI m 0.5 lymphography (discrete) 0.4 optdigits (0 s breast and 1 s, cancer continuous) (discrete) error 0.3 error 0.2 error m m m 0.8 NB: solid LR: dashed sick (discrete) 0.4 voting records (discrete) see [Ng & Jordan, 2002] 11

12 Decision trees

Dichotomous classifier 1. a. Insect has 1 pair of wings... Order Diptera (flies, mosquitoes) b. Insect has 2 pair of wings... go to #2 2. a. Insect has extremely long prothorax (neck)... go to #3 b.

13 Dichotomous classifier 1. a. Insect has 1 pair of wings... Order Diptera (flies, mosquitoes) b. Insect has 2 pair of wings... go to #2 2. a. Insect has extremely long prothorax (neck)... go to #3 b. Insect has a regular length or no prothorax... go to #4 3. a. Forelegs come together in a 'praying' position... Order Mantodea (mantids) b. Forelegs do not come together in a 'praying' position... Order Raphidoptera (snakeflies) 4. a. Wings are armour-like with membraneous hindwings underneath them... Order Coleoptera (beetles) b. Wings are not armour-like... go to #5 5. a. Wings twist when insect is in flight... Order Strepsiptera (twistedwing parasites) b. Wings flap up and down (no twisting) when in flight... go to #6 6. a. Wings are triangular in shape... go to #7 b. Wings are not triangular in shape... go to #8 13

14 Decision tree Problem: classification (or regression) n training examples (x1, y1), (x2, y2), (xn, yn) xi R k, yi {0, 1} well-known implementations: ID3, C4.5, J48, CART 14

15 The picture 15

16 The picture Composition II in Red, Blue, and Yellow Piet Mondrian,

17 Variants Type of question at internal nodes Type of label at leaf nodes Labels on internal nodes or edges 17

18 Variants Decision list Decision diagram (DAG) 18

20 Example Sepal Length Petal Length 20

21 Representational power AND OR XOR 21

22 Why decision trees? Why? flexible hypothesis class work pretty well fairly interpretable very fast at test time closed under common operations Why not DTs? learning is NP-hard often not state-of-art error rate but: see bagging, boosting 22

23 Learning red? fuzzy? Class T T T F + T F F T F F + 23

24 Learning Bigger data sets with more attributes: finding training set MLE is NP-hard Heuristic search: build tree greedily, root down start with all training examples in one bin pick an impure bin try some candidate splits (e.g., all single-attribute binary tests), pick the best (largest increase in likelihood) repeat until all bins are either pure or have no possible splits left (ran out of attributes to split on) 24

25 Information gain red? fuzzy? Class T T T F + T F F T F F + Initially: L = 2 log(.4) + 3 log(.6) Split on red: bin T: 2 log(.667) + log(.333) bin F: 2 log(.5) Split on fuzzy: bin T: 2 log log 0 = 0 bin F: 2 log(.667) + log(.333) In general: H(Y) EX[H(Y X)] 25

26 Real-valued attributes 26

27 Multi-way discrete splits SS# Temp Sick? Split on temp yields {, } and {+,,+} Split on SS# yields 5 pure leaves

28 Pruning Build tree on training set Prune on holdout set: while removing last split along some path improves holdout error, do so if a node N s children are all pruned, then N becomes eligible for pruning 28

29 Prune as rules Alternately, convert each leaf to a rule then prune test1 test2 test3 while dropping a test from a rule improves performance, do so 29

30 Bagging Bagging = bootstrap aggregating Can be used with any classifier, but particularly effective with decision trees Generate M bootstrap resamples Train a decision tree on each one Final classifier: vote all M trees e.g., tree 1 says p(+) =.7, tree 2 says p(+) =.9: predict.8 30

31 Out-of-bag error estimates Each bag contains (1 1/e) (~67%) of examples Use out-of-bag examples to estimate error of each tree To estimate error of overall vote for each example, classify using all out-of-bag trees average across all examples Conservative: we re averaging over ~67% of our trees but if we have lots of trees, bias is small 31

32 Boosting

33 Voted classifiers f: R k { 1, 1} Voted classifier: j fj(x) > 0 Weighted vote: j αj fj(x) > 0 assume wlog αj > 0 optionally scale so αj sum to 1 5 halfspaces (or add constant classifier for H =6) 33

34 Extending the hypothesis space 34 Hastie, Tibshirani, Friedman (2nd ed)

35 Voted classifiers the matrix T distinct classifiers (T < 2 n ) n training examples 35

36 Finding the best voted classifier 36

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Gaussian Naïve Bayes Big Picture Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative