Discriminative v. generative

Naive Bayes 2

Naive Bayes P (x ij,y i )= Y i P (y i ) Y j P (x ij y i ) P (y i =+)=p MLE: max P (x ij,y i ) a j,b j,p p = 1 N P [yi =+] P (x ij =1 y i = ) = a j P (x ij =1 y i =+)=b j a j = P [(y i = ) ^ (x ij = 1)]/ P [y i = ] b j = P [(y i =+)^ (x ij = 1)]/ P [y i = ] P (y i =+ x ij )=1/(1 + exp( z i )) 2k+1 parameters z i = w 0 + P j w jx ij 3

Logistic regression P (y i =+ x ij )=1/(1 + exp( z i )) arg max w Y i = arg min w = arg min w z i = w 0 + P j w jx ij P (y i x ij ) X ln(1 + exp( y i z i )) i X h(y i z i ) i 4

Same model, different answer Why? max P(X, Y) vs. max P(Y X) generative vs. discriminative MLE v. MCLE (max conditional likelihood estimate) How to pick? Typically MCLE better if lots of data, MLE better if not 5

MCLE as MLE max Y i P (x i,y i ) max Y i P (y i x i, )

MCLE as MLE Recipe: MCLE = MLE + extra parameters to decouple P(x) from P(y x) Bias-variance tradeoff: MLE places additional constraints on θ by coupling to P(x) MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θx to θy w/ prior Tom Minka. Discriminative models, not discriminative training. MSR tech report TR-2005-144, 2005 7

Comparison As #examples if Bayes net is right: NB & LR get same answer if not: LR has minimum possible training error train error test error so LR does at least as well as NB, usually better 8

Comparison Finite sample: n examples with k attributes how big should n be for excess risk ϵ? GNB needs n = θ(log k) as long as at least a constant fraction of attributes are relevant Hoeffding for each weight + union bound over weights + bound z away from 0 LR needs n = θ(k) VC-dimension of linear classifier GNB converges much faster to its (perhaps lessaccurate) final estimates see [Ng & Jordan, 2002] 9

Comparison on UCI 0.4 voting records (discrete) 0.5 pima (continuous) 0.3 0.45 error 0.2 error 0.4 0.35 0.1 0.3 0 0 20 40 60 80 m 0.25 0 20 40 60 m NB: solid LR: dashed see [Ng & Jordan, 2002] 10

Comparison on UCI m 0.5 lymphography (discrete) 0.4 optdigits (0 s breast and 1 s, cancer continuous) (discrete) 0.5 0.4 0.3 0.45 error 0.3 error 0.2 error 0.4 0.35 0.2 0.1 0.3 0.1 0 50 100 150 m 0 0.25 0 0 50 100 150 200 200 m m 0.8 NB: solid LR: dashed sick (discrete) 0.4 voting records (discrete) see [Ng & Jordan, 2002] 11

Decision trees

Dichotomous classifier 1. a. Insect has 1 pair of wings... Order Diptera (flies, mosquitoes) b. Insect has 2 pair of wings... go to #2 2. a. Insect has extremely long prothorax (neck)... go to #3 b. Insect has a regular length or no prothorax... go to #4 3. a. Forelegs come together in a 'praying' position... Order Mantodea (mantids) b. Forelegs do not come together in a 'praying' position... Order Raphidoptera (snakeflies) 4. a. Wings are armour-like with membraneous hindwings underneath them... Order Coleoptera (beetles) b. Wings are not armour-like... go to #5 5. a. Wings twist when insect is in flight... Order Strepsiptera (twistedwing parasites) b. Wings flap up and down (no twisting) when in flight... go to #6 6. a. Wings are triangular in shape... go to #7 b. Wings are not triangular in shape... go to #8 http://www.insectidentification.org/winged-insect-key.asp 13

Decision tree Problem: classification (or regression) n training examples (x1, y1), (x2, y2), (xn, yn) xi R k, yi {0, 1} well-known implementations: ID3, C4.5, J48, CART 14

The picture 15

The picture Composition II in Red, Blue, and Yellow Piet Mondrian, 1930 16

Variants Type of question at internal nodes Type of label at leaf nodes Labels on internal nodes or edges 17

Variants Decision list Decision diagram (DAG) 18

Example Sepal Length Petal Length 20

Representational power AND OR XOR 21

Why decision trees? Why? flexible hypothesis class work pretty well fairly interpretable very fast at test time closed under common operations Why not DTs? learning is NP-hard often not state-of-art error rate but: see bagging, boosting 22

Learning red? fuzzy? Class T T T F + T F F T F F + 23

Learning Bigger data sets with more attributes: finding training set MLE is NP-hard Heuristic search: build tree greedily, root down start with all training examples in one bin pick an impure bin try some candidate splits (e.g., all single-attribute binary tests), pick the best (largest increase in likelihood) repeat until all bins are either pure or have no possible splits left (ran out of attributes to split on) 24

Information gain red? fuzzy? Class T T T F + T F F T F F + Initially: L = 2 log(.4) + 3 log(.6) Split on red: bin T: 2 log(.667) + log(.333) bin F: 2 log(.5) Split on fuzzy: bin T: 2 log 1 + 0 log 0 = 0 bin F: 2 log(.667) + log(.333) In general: H(Y) EX[H(Y X)] 25

Real-valued attributes 26

Multi-way discrete splits SS# Temp Sick? 123-45-6789 36 010-10-1010 36.5 + 555-55-1212 41 + Split on temp yields {, } and {+,,+} Split on SS# yields 5 pure leaves 314-15-9265 37 271-82-8183 40 + 27

Pruning Build tree on training set Prune on holdout set: while removing last split along some path improves holdout error, do so if a node N s children are all pruned, then N becomes eligible for pruning 28

Prune as rules Alternately, convert each leaf to a rule then prune test1 test2 test3 while dropping a test from a rule improves performance, do so 29

Bagging Bagging = bootstrap aggregating Can be used with any classifier, but particularly effective with decision trees Generate M bootstrap resamples Train a decision tree on each one Final classifier: vote all M trees e.g., tree 1 says p(+) =.7, tree 2 says p(+) =.9: predict.8 30

Out-of-bag error estimates Each bag contains (1 1/e) (~67%) of examples Use out-of-bag examples to estimate error of each tree To estimate error of overall vote for each example, classify using all out-of-bag trees average across all examples Conservative: we re averaging over ~67% of our trees but if we have lots of trees, bias is small 31

Boosting

Voted classifiers f: R k { 1, 1} Voted classifier: j fj(x) > 0 Weighted vote: j αj fj(x) > 0 assume wlog αj > 0 optionally scale so αj sum to 1 5 halfspaces (or add constant classifier for H =6) 33

Extending the hypothesis space 34 Hastie, Tibshirani, Friedman (2nd ed)

Voted classifiers the matrix T distinct classifiers (T < 2 n ) n training examples 35

Finding the best voted classifier 36