ICDM Pisa, December 16, Motivation

Size: px

Start display at page:

Download "ICDM Pisa, December 16, Motivation"

Shannon Hunter
6 years ago
Views:

1 Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM Pisa, December 16, 2008 Motivation Labeled data are seldom available, and often too expensive to obtain. Abundant labeled data may exist for a different but related domain. Goal: Use the labeled data as auxiliary information to accomplish the task (classification) in the target domain.

D i common words label propagation D o Main Idea Enrich document

2 Main Idea Leverage the shared dictionary across the in-domain and out-of-domain (target) documents to propagate label information. D i common words label propagation D o Main Idea Enrich document representation to fill the semantic gap. D i common words & semantic concepts label propagation D o

3 Co-clustering based Classification D i : in-domain documents D o : out-of-domain documents C : set of class labels W : dictionary of all the words Co-clustering based Classification Co-clustering of D o : C Do : {d 1,..., d m } { ˆd 1, ˆd 2,..., ˆd C } = ˆD o C W : {w 1,..., w n } { ŵ 1, ŵ 2,..., ŵ k } = Ŵ

4 Co-clustering based Classification )(*)&+,$-.#"/01.( :4,;) :4,;) "#8& 64917%82) "#8& 64917%82) *+1)2%,#8>&:4,;)&45&'42?&<83 647"#8&"8;&012345&647"#8& 64917%82) :4,;) ;4917%82& 9+1)2%,) *+1)2%,#8>& "#8&64917%82) :4,;) 4$-506 4$-507 Co-clustering based Classification!"#$#%&#'%$#(" :4,;) *-./0!"#$%& '"(%)& *+"))#!%, <8#2#"+#="2#48& 45&64917%82& *+1)2%,) "#8& 64917%82) <8#2#"+#="2#48& 45&:4,;& *+1)2%,) <83647"#8&"8;& "#8& 64917%82) :4,;)

5 Co-clustering based Classification Iterative algorithm that achieves min ˆDo,Ŵ{I(D o; W) I( ˆD o ; Ŵ)+λ(I(C; W) I(C; Ŵ))} loss in mutual information between documents and words loss in mutual information between class labels and words Information Theoretic Co-clustering [Dhillion et al., KDD 03] I(D o ; W) I( ˆD o ; Ŵ) I(X; Y )= x p(x, y) log y p(x, y) p(x)p(y) I(C; W) I(C; Ŵ)

6 f(w) = p(d, w),f(d w) =p(d w) = d D o f(d) = p(d, w),f(w d) =p(w d) = w W f(d, w) f(w), p(d, w) f(d), ˆf(ŵ ˆd) =p(ŵ ˆd), ˆf( ˆd ŵ) =p( ˆd ŵ), ˆf(d ˆd) =p(d ˆd), ˆf(w ŵ) =p(w ŵ), ˆf(d ŵ) = ˆf(d ˆd) ˆf( ˆd ŵ) =p(d ˆd)p( ˆd ŵ) ˆf(w ˆd) = ˆf(w ŵ) ˆf(ŵ ˆd) =p(w ŵ)p(ŵ ˆd) g(c, w) =p(c, ŵ)p(w ŵ) =p(c, ŵ) p(w) p(ŵ) g(w) = p(c, w),g(c w) =p(c w) = c C w ŵ ĝ(c ŵ) = p(c w)p(w) = p(ŵ) g(c, w) g(w), w ŵ p(c w)p(w) w ŵ p(w).

7 Co-clustering based Classification I(D o ; W) I( ˆD o ; Ŵ)+λI(C; W) I(C; Ŵ) = D(f(D o ; W) ˆf(D o ; W)) + λd(g(c, W) ĝ(c, W)) D(p(x) q(x)) = x p(x) log p(x) q(x) Co-clustering based Classification D(f(D o, W) ˆf(D o, W)) f(d)d(f(w d) ˆf(W ˆd)) = ˆd ˆDo d ˆd D(f(D o, W) ˆf(D o, W)) = f(w)d(f(d o w) ˆf(D o ŵ)) ŵ Ŵ w ŵ D(g(C, W) ĝ(c, W)) = g(w)d(g((c w) ĝ(c ŵ))) ŵ Ŵ w ŵ

8 Co-clustering based Classification C (t) D o (d) = argmin D(f(W d) ˆf (t 1) (W ˆd)) ˆd C (t+1) W (d) = argmin f(w)d(f(d o w) ˆf(D o ŵ)) ŵ +λg(w)d(g((c w) ĝ(c ŵ))) Main Idea Enrich document representation to fill the semantic gap. D i common words & semantic concepts label propagation D o

Building Semantic Kernels from Wikipedia: Overall Approach Build Thesaurus from Wikipedia Build Semantic Kernels Search Wikipedia Concepts in Documents Ambiguous Concepts: Puma Puma (Car) Redirect

.. The Cougar, also Puma and Mountain lion, is a New World mammal of the Felidae family.

9 Building Semantic Kernels from Wikipedia: Overall Approach Build Thesaurus from Wikipedia Build Semantic Kernels Search Wikipedia Concepts in Documents Ambiguous Concepts: Puma Puma (Car) Redirect Concepts of "Puma" "Cougar" "Felidae" Category "Puma" Related Concepts of "Puma" 1 a... b a 1... c b c... 1 Text Document "... The Cougar, also Puma and Mountain lion, is a New World mammal of the Felidae family..." Concept "Puma" belongs to Category "Felidae" "Mountain Lion" Wikipedia Concept Proximity Matrix Candidate Concepts Puma 2... "Ford Vehicles" Category "Puma (Car)" "Automobile" Enrich Document Representation with Wikipedia Concepts Enriched Document Representation Puma 2 Cougar 2 Felines Disambiguation "Puma" here means a kind of animal, not car or sportbrand. Terms Concepts Proximity Matrix Terms Concepts a b a 1 c b c 1 S = λ 1 S BOW + λ 2 S OLC + (1 λ 1 λ 2 )(1 D cat ) Outlink category-based Contentbased Distancebased

10 Terms Concepts Terms Proximity Matrix Concepts a b a 1 c b c 1 1 if c i and c j are synonyms; µ P ij = depth if c i and c j are hyponyms; S if c i and c j are associative concepts; 0 otherwise. S = λ 1 S BOW + λ 2 S OLC + (1 λ 1 λ 2 )(1 D cat ) Outlink category-based Contentbased Distancebased Building Semantic Kernels Machine learning, statistical learning and data mining are related subjects. Original BOW Vector <machine:1, statistical:1, learn:2, data:1, mine:1, relate:1, subject:1> Find Wikipedia Concepts and Keep as it is φ(d) P φ(d) =φ(d)p <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1;... >... Machine Learning Statistical Learning Data Mining Artificial Intelligence... Machine Learning Statistical Learning Data Mining Artificial Intelligence Enriched Document Vector Representation = <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1; artificial intelligence:0.3252>

11 Empirical Evaluation Data sets: 20Newsgroups and SRAA Methods: CoCC w/ and w/out enrichment NB w/ and w/out enrichment Cross-domain Classification Precision Rates Data Set w/o enrichment w/ enrichment NB CoCC NB CoCC rec vs talk rec vs sci comp vs talk comp vs sci comp vs rec sci vs talk rec vs sci vs comp rec vs talk vs sci sci vs talk vs comp rec vs talk vs sci vs comp real vs simulation auto vs aviation

)"! CoCC with enrichment: Precision as a function of the number of iterations!"(!"',-./01023!"&!"%!"$!"#! ) * + # $ % & ' ( )! )) )* )+ )# )$ )% )& )' )( *! *) ** *+ *# *$ *% *& 45.-650231 -.

12 )"! CoCC with enrichment: Precision as a function of the number of iterations!"(!"',-./01023!"&!"%!"$!"#! ) * + # $ % & ' ( )! )) )* )+ )# )$ )% )& )' )( *! *) ** *+ *# *$ *% *& / :78171/0 -./78171/07817/2;< 1/ :7817/2;< 1/ / :7817/2;< CoCC with enrichment: Precision as a function of λ (sci vs talk vs comp) '"!!"& *+,-./.01!"%!"$!"# '#& /7,+/ %$ /7,+/ '% /7,+/!!"!('#)!"!%#)!"'#)!"#)!") ' # $ &!

13 CoCC with enrichment: Precision as a function of the number of word clusters (sci vs talk vs comp),-./01023 #"!!"+!"*!")!"(!"'!"&!"%!"$!"#!!?#!?!"$'!?!"#$' $ & * #( %$ (& #$* $'( '#$ :2-;8<=51>.-1 Conclusions Extended co-clustering approach for crossdomain text classification by embedding background knowledge using Wikipedia Future work: Explore alternative representations for common language substrate Cross-language text classification

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy: