Building Maximum Entropy Text Classifier Using Semi-supervised Learning

Size: px

Start display at page:

Download "Building Maximum Entropy Text Classifier Using Semi-supervised Learning"

Alexander Sharp
6 years ago
Views:

1 Buldng Maxmum Entropy Text Classfer Usng Sem-supervsed Learnng Zhang, Xnhua For PhD Qualfyng Exam Term Paper School of Computng, NUS 1

2 Road map Introducton: bacground and applcaton Sem-supervsed learnng, especally for text classfcaton (survey) Maxmum Entropy Models (survey) Combnng sem-supervsed learnng and maxmum entropy models (new) Summary School of Computng, NUS 2

3 Road map Introducton: bacground and applcaton Sem-supervsed learnng, esp. for text classfcaton (survey) Maxmum Entropy Models (survey) Combnng sem-supervsed learnng and maxmum entropy models (new) Summary School of Computng, NUS 3

4 Introducton: Applcaton of text classfcaton Text classfcaton s useful, wdely appled: catalogng news artcles (Lews & Gale, 1994; Joachms, 1998b); classfyng web pages nto a symbolc ontology (Craven et al., 2000); fndng a person s homepage (Shavl & Elass-Rad, 1998); automatcally learnng the readng nterests of users (Lang, 1995; Pazzan et al., 1996); automatcally threadng and flterng emal by content (Lews & Knowles, 1997; Saham et al., 1998); boo recommendaton (Mooney & Roy, 2000) School of Computng, NUS 4

5 Early ways of text classfcaton Early days: manual constructon of rule sets. (e.g., f advertsement appears, then fltered). Hand-codng text classfers n a rule-based style s mpractcal. Also, nducng and formulatng the rules from examples are tme and labor consumng School of Computng, NUS 5

6 Supervsed learnng for text classfcaton Usng supervsed learnng Requre a large or prohbtve number of labeled examples, tme/labor-consumng. E.g., (Lang, 1995) after a person read and handlabeled about 1000 artcles, a learned classfer acheved an accuracy of about 50% when mang predctons for only the top 10% of documents about whch t was most confdent School of Computng, NUS 6

7 What about usng unlabeled data? Unlabeled data are abundant and easly avalable, may be useful to mprove classfcaton. Publshed wors prove that t helps. Why do unlabeled data help? Co-occurrence mght explan somethng. Search on Google, Sugar and sauce returns 1,390,000 results Sugar and math returns 191,000 results though math s a more popular word than sauce School of Computng, NUS 7

8 Usng co-occurrence and ptfalls Smple dea: when A often co-occurs wth B (a fact that can be found by usng unlabeled data) and we now artcles contanng A are often nterestng, then probably artcles contanng B are also nterestng. Problem: Most current models usng unlabeled data are based on problem-specfc assumptons, whch causes nstablty across tass School of Computng, NUS 8

9 Road map Introducton: bacground and applcaton Sem-supervsed learnng, especally for text classfcaton (survey) Maxmum Entropy Models (survey) Combnng sem-supervsed learnng and maxmum entropy models (new) Summary School of Computng, NUS 9

10 Generatve and dscrmnatve sem-supervsed learnng models Generatve sem-supervsed learnng Expectaton-maxmzaton algorthm, whch can fll the mssng value usng maxmum lelhood Dscrmnatve sem-supervsed learnng (Ngam, 2001) (Vapn, 1998) Transductve Support Vector Machne (TSVM) fndng the lnear separator between the labeled examples of each class that maxmzes the margn over both the labeled and unlabeled examples School of Computng, NUS 10

11 Co-tranng Other sem-supervsed learnng models (Blum & Mtchell, 1998) Actve learnng e.g., (Schohn & Cohn, 2000) Reduce overfttng e.g. (Schuurmans & Southey, 2000) School of Computng, NUS 11

12 Theoretcal value of unlabeled data Unlabeled data help n some cases, but not all. For class probablty parameters estmaton, labeled examples are exponentally more valuable than unlabeled examples, assumng the underlyng component dstrbutons are nown and correct. (Castell & Cover, 1996) Unlabeled data can degrade the performance of a classfer when there are ncorrect model assumptons. (Cozman & Cohen, 2002) Value of unlabeled data for dscrmnatve classfers such as TSVMs and for actve learnng are questonable. (Zhang & Oles, 2000) School of Computng, NUS 12

13 Models based on clusterng assumpton (1): Manfold Example: handwrtten 0 as an ellpse (5-Dm) Classfcaton functons are naturally defned only on the submanfold n queston rather than the total ambent space. Classfcaton wll be mproved f the convert the representaton nto submanfold. Same dea as PCA, showng the use of unsupervsed learnng n sem-supervsed learnng Unlabeled data help to construct the submanfold School of Computng, NUS 13

14 Manfold, unlabeled data help A B Beln & Nyog 2002 A B School of Computng, NUS 14

15 Models based on clusterng assumpton (2): Kernel methods Objectve: mae the nduced dstance small for ponts n the same class and large for those n dfferent classes Example: Generatve: for a mxture of Gaussan ( µ, Σ ) one ernel can be defned as (, ) ( ) ( ) T K x y = P xp y x Σ y Dscrmnatve: RBF ernel matrx Can unfy the manfold approach q = ( ) K = exp x x / σ j j (Tsuda et al., 2002) School of Computng, NUS 15

16 Models based on clusterng assumpton (3): Mn-cut Express par-wse relatonshp (smlarty) between labeled/unlabeled data as a graph, and fnd a parttonng that mnmzes the sum of smlarty between dfferently labeled examples School of Computng, NUS 16

17 Mn-cut famly algorthm Problems wth mn-cut Degeneratve (unbalanced) cut Remedy Randomness Normalzaton, le Spectral Graph Parttonng Prncple: Averages over examples (e.g., average margn, pos/neg rato) should have the same expected value n the labeled and unlabeled data School of Computng, NUS 17

18 Road map Introducton: bacground and applcaton Sem-supervsed learnng, esp. for text classfcaton (survey) Maxmum Entropy Models (survey) Combnng sem-supervsed learnng and maxmum entropy models (new) Summary School of Computng, NUS 18

19 Overvew: Maxmum entropy models Advantage of maxmum entropy model Based on features, allows and supports feature nducton and feature selecton offers a generc framewor for ncorporatng unlabeled data only maes wea assumptons gves flexblty n ncorporatng sde nformaton natural mult-class classfcaton So maxmum entropy model s worth further study School of Computng, NUS 19

20 Feature n MaxEnt Indcate the strength of certan aspects n the event e.g., f t (x, y) = 1 f and only f the current word, whch s part of document x, s bac and the class y s verb. Otherwse, f t (x, y) = 0. Contrbutes to the flexblty of MaxEnt School of Computng, NUS 20

21 p( x ) p( y x )log p( y x ) Standard MaxEnt Formulaton maxmze s.t. px ( ) py ( x )log p( y x ) px ( ) py ( x) f( x, y) t p ( x ) p ( y x ) f ( x, y ) for all t = t py ( x) = 1 for all The dual problem s just the maxmum lelhood problem School of Computng, NUS 21

22 Smoothng technques (1) Gaussan pror (MAP) maxmze σ δ 2 t 2 px ( ) py ( x)log py ( x) t t 2 + s.t. E [ f ] p( x ) p( y x ) f ( x, y ) =δ for all t p t t t p( y x ) = 1 for all School of Computng, NUS 22

23 Smoothng technques (2) Laplacan pror (Inequalty MaxEnt) maxmze px ( ) py ( x)log py ( x) s.t. E [ f ] p( x ) p( y x ) f ( x, y ) A for all t p t t t p( x ) p( y x ) f ( x, y ) E [ f ] B for all t t p t t p( y x ) = 1 for all Extra strength: feature selecton School of Computng, NUS 23

24 MaxEnt parameter estmaton Convex optmzaton Gradent descent, (conjugate) gradent descent Generalzed Iteratve Scalng (GIS) Improved Iteratve Scalng (IIS) Lmted memory varable metrc (LMVM) Sequental update algorthm School of Computng, NUS 24

25 Road map Introducton: bacground and applcaton Sem-supervsed learnng, esp. for text classfcaton (survey) Maxmum Entropy Models (survey) Combnng sem-supervsed learnng and maxmum entropy models (new) Summary School of Computng, NUS 25

26 Sem-supervsed MaxEnt Why do we choose MaxEnt? 1 st reason: smple extenson to sem-supervsed learnng maxmze where 2 nd reason: wea assumpton px ( ) py ( x)log py ( x) s.t. E [ f ] px ( ) py ( x ) f ( x, y ) = 0 for all t p t t py ( x) = 1 for all E [ f ] p ( x ) p ( y x ) f ( x, y ) = p t t School of Computng, NUS 26

27 Estmaton error bounds 3 rd reason: estmaton error bounds n theory maxmze s.t. px ( ) py ( x)log py ( x) E [ f ] p( x ) p( y x ) f ( x, y ) A for all t p t t t p( x ) p( y x ) f ( x, y ) E [ f ] B for all t t p t t p( y x ) = 1 for all School of Computng, NUS 27

28 Sde Informaton Only assumptons over the accuracy of emprcal evaluaton of suffcent statstcs s not enough O y x 1. O xy 2. Use dstance/smlarty nfo School of Computng, NUS 28

29 Source of sde nformaton Instance smlarty. neghborng relatonshp between dfferent nstances redundant descrpton tracng the same object Class smlarty, usng nformaton on related classfcaton tass combnng dfferent datasets (dfferent dstrbutons) whch are for the same classfcaton tas; herarchcal classes; structured class relatonshps (such as trees or other generc graphc models) School of Computng, NUS 29

30 Incorporate smlarty nformaton: flexblty of MaxEnt framewor Add assumpton that the class probablty of x, x j s smlar f the dstance n one metrc s small between x, x j. Use the dstance metrc to buld a mnmum spannng tree and add sde nfo to MaxEnt. Maxmze: px ( ) py ( x)log py ( x) w ε 2,(, j), j,,(, j) E E [ f ] p( x ) p( y x ) f ( x, y ) for all t = p t t py ( x) = 1 for all j, j, py ( x) py ( x) = ε for all and ( j, ) E w C w where w (,j) s the true dstance between (x, x j ),(, j) s (, j) School of Computng, NUS 30

31 Connecton wth Mn-cut famly s.t. Spectral Graph Parttonng y y max y cut( G, G { y = 1} { y = 1} + =+ 1 f x s postvely labeled = 1 f x s negatvely labeled y { + 1, 1} n ) Harmonc functon (Zhu et al. 2003), j mnmze 1 w j = = 2 ( Py ( 1) Py ( j 1) ) 2 maxmze px ( ) py ( x)log p( y x) w,(, j) E ε 2,(, j), j, E [ f ] p( x ) p( y x ) f ( x, y ) for all t = p t t p( y x ) = 1 for all j, j, py ( x) py ( x) = ε for all and ( j, ) E ε, j,? School of Computng, NUS 31

32 Mscellaneous promsng research openngs (1) Feature selecton Greedy algorthm to ncrementally add feature to the random feld by selectng the feature whch maxmally reduces the objectve functon. Feature nducton If IBM appears n labeled data whle Apple does not, then usng IBM or Apple as feature can help (though costly) School of Computng, NUS 32

33 Mscellaneous promsng research openngs (2) Interval estmaton mnmze s.t. px ( ) py ( x)log py ( x) B E [ f ] p( x ) p( y x ) f ( x, y ) A for all t t p t t t How should we set the A t and B t? Whole bunch of results n statstcs. W/S LLN, Hoeffdng s nequalty or usng more advanced concepts n statstcal learnng theory, e.g., VC-dmenson of feature class School of Computng, NUS 33 p( y x ) = 1 for all ( ) 2 p t p t > β β P E [ f ] E [ f ] exp( 2 m)

34 Mscellaneous promsng Re-weghtng Z = exp λt ft( x, y) t s.t. research openngs (3) In vew that the emprcal estmaton of statstcs s naccurate, we add more weght to the labeled data, whch may be more relable than unlabeled data. mnmze σ δ 2 t 2 px ( ) py ( x)log py ( x) + t t 2 E [ f ] p( x ) p( y x ) f ( x, y ) =δ for all t p t t t p( y x ) = 1 for all School of Computng, NUS 34

35 Re-weghtng Orgnally, n 1 labeled examples and n 2 unlabeled examples β copes β copes l l of x of x 1 1 n l l l u u u l l l l u u u x1, x2,..., xn, x 1 1, x2,..., xn x 2 1,... x1,..., xn,... x, 1 n x 1 1, x2,..., xn 2 n labeled data n unlabeled data 1 2 β n labeled data n unlabeled data 1 2 Then p(x) for labeled data: n 1 β 1+ n2 β n1+ n2 p(x) for unlabeled data: 1+ n2 1 2 All equatons before eep unchanged! School of Computng, NUS 35 n 1 1 β n + n

36 Intal expermental results Dataset: optcal dgts from UCI 64 nput attrbutes rangng n [0, 16], 10 classes Algorthms tested MST MaxEnt wth re-weght Gaussan Pror MaxEnt, Inequalty MaxEnt, TSVM (lnear and polynomal ernel, one-aganst-all) Testng strategy Report the results for the parameter settng wth the best performance on the test set School of Computng, NUS 36

37 Intal experment result School of Computng, NUS 37

38 Summary Maxmum Entropy model s promsng for semsupervsed learnng. Sde nformaton s mportant and can be flexbly ncorporated nto MaxEnt model. Future research can be done n the area ponted out (feature selecton/nducton, nterval estmaton, sde nformaton formulaton, reweghtng, etc) School of Computng, NUS 38

39 Queston and Answer Sesson Questons are welcomed School of Computng, NUS 39

40 GIS Iteratve update rule for uncondtonal probablty: E [ ] ( 1) ( ) p f s+ s t λt = λt + log E ( s )[ ft ] p GIS for condtonal probablty px ( j) ft( x) j ( s+ 1) ( s) j p ( x) = p ( x) ( s) t p ( xj) ft( xj) j E [ ] ( 1) ( ) p f s+ s t λt = λt + ηlog ( s) p( x) p( y x, λ ) ft( x, y) E [ ] ( ) p f s t = λt + ηlog E ( s )[ ft ] p f ( x ) t School of Computng, NUS 40

41 IIS Characterstc: monotonc decrease of MaxEnt objectve functon each update depends only on the computaton of expected values ( ), not requrng the gradent or hgher dervatves s E p Update rule for uncondtonal probablty: s the soluton to: λ t ( ) E [ ] s p ft = p ( x) ft( x)exp λt f j( x) for all t j λ t are decoupled and solved ndvdually Monte Carlo methods are to be used f the number of possble x s too large School of Computng, NUS 41

42 ( s) λ t GIS Characterstcs: converges to the unque optmal value of λ ( parallel update,.e., s ) are updated synchronously slow convergence prerequste of orgnal GIS λ t for all tranng examples x : ( ) 0 and relaxng prerequste f ( x ) = C then defne f t t t t f x ft( x ) = 1 t f = f C If not all tranng data have summed feature equalng C, then set C suffcently large and ncorporate a correcton feature. t School of Computng, NUS 42

43 Other standard optmzaton algorthms Gradent descent ( s+ 1) ( s) L t = t + ( s) λ λ = λ t λ λ η Conjugate gradent methods, such as Fletcher- Reeves and Pola-Rbêre-Postve algorthm lmted memory varable metrc, quas-newton methods: approxmate Hessan usng successve evaluatons of gradent School of Computng, NUS 43

44 Sequental updatng algorthm For a very large (or nfnte) number of features, parallel algorthms wll be too resource consumng to be feasble. Sequental update: A style of coordnate-wse descent, modfes one parameter at a tme. Converges to the same optmum as parallel update School of Computng, NUS 44

45 p( x ) p( y x )log p( y x ) Dual Problem of Standard MaxEnt mnmze p( x ) p( y x )log p( y x ) Dual problem: where px ( ) py ( x)log py ( x) = E [ f ] p( x ) p( y x ) f ( x, y ) 0 for all t p t t mn py ( x) = 1 for all L( p, λ) = λe [ f ] + px ( )log Z Z t t p t = exp λ f ( x, y ) t t t School of Computng, NUS 45

46 p ( x )log p ( x ) + λ E [ f ] p( x )logz t p t t Relatonshp wth maxmum lelhood Suppose where Dual of MaxEnt: 1 p( y x) = exp λt ft( x, y) Z t L( ) p( x, y )log p( x, y ) λ = = px ( )log px ( ) + λ E[ f] px ( )logz t p t t L( p, λ) = λe [ f ] + px ( )log Z mn Z = exp λ f ( x, y ) t t t t t p t maxmze mnmze School of Computng, NUS 46

47 Smoothng technques (2) Exponental pror Z = exp λt ft( x, y) t s.t. mnmze Dual problem: mnmze Equvalent To maxmze p( x ) p( y x )log p( y x ) E [ f ] px ( ) py ( x ) f ( x, y ) A for all t p t t t p( y x ) = 1 for all School of Computng, NUS 47 L( λ) = λe [ f ] + p( x )logz + Aλ Z t p t t t t t = exp λ f ( x, y ) t t t py ( x) Aexp( Aλ ) t t t t

48 Smoothng technques (1) Gaussan pror (MAP) Z = exp λt ft( x, y) t mnmze s.t. Dual problem: mnmze σ δ 2 t 2 px ( ) py ( x)log py ( x) + t t 2 E [ f ] p( x ) p( y x ) f ( x, y ) =δ for all t p t t t p( y x ) = 1 for all 2 t λt p t 2 t t 2σ t L( λ) = E [ f ] + p( x )log( Z ) + Z = exp λ f ( x, y ) t t t λ School of Computng, NUS 48

49 Smoothng technques (3) Laplacan pror (Inequalty MaxEnt) Z = exp λt ft( x, y) t s.t. mnmze px ( ) py ( x)log py ( x) B E [ f ] p( x ) p( y x ) f ( x, y ) A for all t t p t t t School of Computng, NUS 49 p( y x ) = 1 for all t t p t t + Aα + B β αt 0, βt 0 L( αβ, ) = ( α β) E [ f ] + p( x )logz Dual problem: mnmze t t t t where Z t t = exp( ( α β ) f ( x, y )) t t t t

50 Smoothng technques (4) Inequalty wth 2-norm Penalty Z = exp λt ft( x, y) t mnmze px ( ) py ( x)log py ( x) + Cδ + Cζ t 2 t t t s.t. E [ f ] p( x ) p( y x ) f ( x, y ) A +δ for all t p t t t t px ( ) py ( x ) f ( x, y ) E [ f ] B +ζ for all t t p t t t p( y x ) = 1 for all School of Computng, NUS 50

51 Smoothng technques (5) Inequalty wth 1-norm Penalty Z = exp λt ft( x, y) t mnmze s.t. px ( ) py ( x)log py ( x) + Cδ + Cζ 1 t 2 t t t E [ f ] p( x ) p( y x ) f ( x, y ) A +δ for all t p t t t t px ( ) py ( x ) f ( x, y ) E [ f ] B +ζ for all t t p t t t p( y x ) = 1 for all δt 0, ζt 0 for all t School of Computng, NUS 51

52 Usng MaxEnt as Smoothng Add maxmum entropy term nto the target functon of other models, usng MaxEnt s preference of unform dstrbuton maxmze s.t. maxmze mnmze s.t School of Computng, NUS 52

53 Bounded error Correct dstrbuton p C (x ) Concluson: C C C E [ f ] p ( x ) p ( y x ) f ( x, y ) = p t t C C L ( λ) = λe [ f ] + p( x )logz p t p t t then ˆ AB, = arg mn L p ( λ) λ λ * λ L ˆ λ L λ + λ A + B C C * * p ( ) p( ) t ( t t) t C = arg mn ( λ) λ L p School of Computng, NUS 53

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set