Feature selection. Micha Elsner. January 29, 2014

Size: px

Start display at page:

Download "Feature selection. Micha Elsner. January 29, 2014"

Irma Melton
5 years ago
Views:

1 Feature selection Micha Elsner January 29, 2014

2 2 Using megam as max-ent learner Hal Daume III from UMD wrote a max-ent learner Pretty typical of many classifiers out there... Step one: create a text file with tags and features: IN word-in title-case DT word-an NNP has-punc title-case word-oct. mixed-case CD numeric word-19 <TAG> <feature> <feature> <feature> Features are binary; only write down those with value 1 Step two: run megam multitron selects stochastic gradient training megam_i686.opt -nc -pa multitron train.txt > classifier.cls Output in classifier.cls gives Θ values for each feature/class: word-an <FEATURE> <theta_t1> <theta_t2>...

3 3 From last lecture: Using an optimizer Why d we do the whole calculus thing? For vanilla max-entropy, you don t need to do any calculus Use megam or whatever For more complex max-ent-like models, need to write own log-likelihood and derivative functions Pass them to optimizer package For instance, scipy.optimize library provides: fmin_l_bfgs_b(func, x0, fprime = None,...) Minimize a function func using the L-BFGS-B algorithm. Arguments: func x0 -- function to minimize. Called as func(x, *args) -- initial guess to minimum fprime -- gradient of func.... Called as fprime(x, *args)

4 4 Where do features come from? We ve seen some example classification projects Features: Throw in all the words (or all the bigrams) POS tags Information from syntax trees Semantics from lexical resources (Wordnet, sentiment, etc) Dialogue act tags, discourse relations, etc Prosody Morphology/spelling features Of course, your task might require different things

5 5 Feature interactions Maximum entropy is a linear model Contribution of different features is additive Can t learn superadditive effect X and Y way better evidence than X+Y Or xor effect X or Y but not both Can improve this by manually adding interactions: f i j = f i f j Many possible interactions... In the end, you have many, many features

6 6 Is more always better? Adding features can make things worse Naive Bayes Correlated features throw things off Maximum entropy (and similar) Optimizer efficiency decreases Overfitting increases Rule of thumb is about 10 items per parameter So program is slow and also not very good The curse of dimensionality As number of dimensions increases, space gets larger... Hypercube (k-d) has d k volume Fewer samples in each region of space Harder to guess what will happen in that region

7 7 Feature selection Smoothing (regularization) Sometimes fancy smoothing methods that push lots of weights to 0 External quality metrics Measure how good each feature is on its own Stepwise selection Add or remove features one at a time Making up metafeatures Dimensionality reduction, clustering

8 8 Smoothing Why this could help: control overfitting Basically the same issue as for LMs Estimates match training data better than they match the real world More features, better potential match to training data, more overfitting

9 9 Smoothing for the maximum entropy model P(T = t F) = exp(θ t F) t exp(θ t F) Overfitting means making P(T F) really large for training set Do this by making various θ larger or smaller Smoothing: control size of θ Bayesian interpretation: put prior on θ Our initial belief about θ: not that large P(Θ D) P(D Θ)P(Θ) P(D Θ) is the likelihood, P(Θ) is our prior

10 Prior on θ Prior: θ has a normal (Gaussian) distribution with mean 0 Gaussian distribution Continuous distribution on real numbers Mean µ, variance σ 2 ) (x µ)2 N(x; µ, σ) exp ( 2σ 2 en.wikipedia.org/wiki/normal_distribution 10

11 11 Modifying the inference algorithm Adding the prior changes the log-likelihood: P(D; Θ, σ) = t i,f i D log P(D; Θ, σ) = t i,f i D log P(D; Θ, σ) = t i,f i D P(t i F i ; Θ) log P(t i F i ; Θ) + j log P(t i F i ; Θ) + j N(θ j ; 0, σ) j log exp (θ j 0) 2 2σ 2 θ2 j 2σ 2

12 12 Which leads to a simple formula log P(D; Θ, σ) = t i,f i D log P(t i F i ; Θ) + λ j θ 2 j λ controls how much regularization you get As usual, fit on held-out data Usually controllable in packages Megam: lambda < float >

13 13 And we have to recompute the derivative θ j j θ 2 j 2σ 2 Recalling that d dx x 2 = 2x (and the 2s cancel) 1 σ 2 θ j

14 14 What this means Modified log-likelihood has a penalty term Decreased by square of each θ So large-magnitude θ get punished Rare features: not worth it to increase θ Less overfitting More general features get smoothed less Prior penalty is the same, but likelihood increase is better

15 15 Fancier smoothing schemes Normal distribution penalty for small weights is small Squared penalty Stricter penalties: L 1 : absolute value penalty Penalty stays significant for small weights Forces weights all the way to 0 Derivative doesn t exist at µ, can use OWL-QN optimizer (How do you pick? L 2 when features are relevant but noisy, L 1 when most features aren t useful at all)

16 16 Even fancier smoothing schemes For instance, so-called L 1, inf regularizer: log P(D; Θ, σ) = t i,f i D ( ) log P(t i F i ; Θ) +λ max θ POS POS ( ) +λ max θ LEX LEX ( ) +λ max θ SUFF SUFF Group features together, penalize largest weight in group Try to select useful groups of features

17 17 Should you do this? Always use some regularizer, even if you re using another selection scheme Simple regularization isn t usually enough Fancy regularization isn t always available for standard packages Or it may be much slower

18 18 External selection Check if features on their own predict the tag Tons and tons of methods Not an area I know in detail Representative method: Chi-squared Use Chi-squared hypothesis test to evaluate: Null hypothesis: f j is independent of t If null can be rejected at some p-value, keep f j Test doesn t say how much information is there... Just that there might be some

19 19 Representative method: mutual information Information theory relates probability to encodings Theory of compression Unpredictable information takes more bits to compress Mutual information I(F, T ) = values f of F tags t P(F = f, T = t) log If F and T independent, P(F, T ) = P(F)P(T ) So I is 0 Otherwise I > 0 P(F = f, T = t) P(F = f )P(T = t) Interpretation: expected number of bits we save by encoding F, T together rather than separately MI does measure strength of association

20 20 Example Tags: NNP, other Feature: title case, not title case title no title P(T) NNP other P(F) I(F, T ) =( [NNP title].09 log ).006 +( [NNP notitle].006 log ) +( [other title].041 log ) +( [other notitle].86 log =.295 bits )

21 21 Strengths and weaknesses External selection is fast... Not distracted by relationships between features Easy to interpret Modular: Can swap classifiers easily External selection doesn t care about your classifier Beware of using a linear method and a non-linear classifier If features are highly correlated, you may get many copies of information you already have Rare features, even important ones, often ignored

22 22 Step-wise selection Stepwise forward Learn all possible one-feature classifiers (search over f? ) Keep the best feature f 1 Learn all two-feature classifiers f 1, f? If any is better, keep the best f 1, f 2...etc Stepwise backward (ablation) Learn classifier using f 1, f 2,... f n Learn all one-fewer classifiers f 1, f 2, f i j, f n If any one-fewer is better than original Throw out worst f j Learn all two-fewer classifiers without f j, f?...etc

23 23 Comments on step-wise procedures Stepwise algorithms can be very slow Lots of relearning the classifier However they can work better than external methods They control for correlations Once you have f1 you don t add any copies of it Forward is biased toward most general features Backward towards most precise Neither theoretically guaranteed to find the best set This is another optimization problem And there is research on it! But not mature enough to use out of the box

24 24 Metafeatures Perhaps your features represent a few underlying phenomena: word-the, POS-tag-DT are related So are preceding-word-the, POS-tag-NN Encode underlying definiteness feature This is why they are correlated In math terms, you have a k-dimensional feature space But there might only be d << k real dimensions Dimensionality reduction Summarize (or embed ) a high-dimensional space in a few dimensions Or perhaps discrete clusters A way of dealing with the curse of dimensionality

25 25 Principal components analysis (PCA) Suppose we have a sample of sharks which vary along two dimensions Length and weight Long sharks tend to be heavy... short sharks are usually light

26 Really only one dimension 26

27 27 Center and rotate the data New X axis corresponds to size Long, heavy sharks on the right; short, light ones on the left New Y axis is residual deviation from this relationship Sharks that are lighter or heavier than normal for their length

28 28 Using PCA Make your data into a big matrix Center and rotate it Then throw out dimensions along which there isn t much variance Residual dimensions In this example, we d keep new X and throw out new Y

29 Let s see it again image: Kendrick Kay, http://randomanalyses.blogspot.

29 29 Let s see it again image: Kendrick Kay, /01/principal-components-analysis.html which is worth reading in full

30 30 About PCA Implemented as eigen-decomposition Issues with PCA: Scales badly to mega-matrices Output matrix is dense even if input is sparse Metafeatures difficult to interpret They incorporate information from lots of features Linear correlations only... PCA doesn t care about labels, only features Might destroy information your classifier will want More advanced methods fix some of these, but at expense of time and accessibility

Conditional Random Fields

Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not