Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs

Size: px
Start display at page:

Download "Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs"

Transcription

1 Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs Klaus Berberich inf.mpg.de) Joint work with: Srikanta Bedathur, Jens DiIrich, Nikos Mamoulis, and Gerhard Weikum (originally presented at VLDB 2010)

2 Mo'va'on Con'nuously growing wealth of unstructured text data, e.g.: e- mail and instant messaging social media content (e.g., twiier and facebook) customer reports and product reviews general Web (e.g., news portals) Search on this data is understood and under control (e.g., find all news ar'cles that men'on barack obama) Gaining insights is tedious and a task mostly lew to users (e.g., iden'fy characteris'c quota'ons by barack obama) 2 / 27

3 Mo'va'on Tools available today include Tag clouds Search- result snippets Limita4ons: focus on single terms no en'ty names, quota'ons, etc. frequency- based not necessarily interes'ng (e.g., stopwords) global analysis no ad- hoc document sets (e.g., determined by query) Our idea: Iden'fy interes'ng phrases that dis'nguish an ad- hoc document set from the document collec'on as a whole 3 / 27

4 Mo'va'on Tools available today include Tag clouds Search- result snippets Limita4ons: focus on single terms no en'ty names, quota'ons, etc. frequency- based not necessarily interes'ng (e.g., stopwords) global analysis no ad- hoc document sets (e.g., determined by query) Our idea: Iden'fy interes'ng phrases that dis'nguish an ad- hoc document set from the document collec'on as a whole 3 / 27

5 Mo'va'on Tools available today include Tag clouds Search- result snippets Limita4ons: focus on single terms no en'ty names, quota'ons, etc. frequency- based not necessarily interes'ng (e.g., stopwords) global analysis no ad- hoc document sets (e.g., determined by query) Our idea: Iden'fy interes'ng phrases that dis'nguish an ad- hoc document set from the document collec'on as a whole 3 / 27

6 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 4 / 27

7 Model & Problem Statement Document collec'on D Documents are sequences of terms (e.g., d 12 = < a x z b l k a q x > ) D Phrases are sequences of terms (e.g., < a x z >, < a x z b >, < z b l k >, ) a x z a x z a x z a x b zl k a x a x b z b zl k a x a x l a k a x b zl q x b z b z l a x a k a k a x l k a x q x b zl q x b z b z l a k a k a x l a k z x q x b l q x b z b z q x l a k a k k x l a k q x b zl q x q x d l a e a kq x q x d a kq x a q x s q x q a y d 12 d 37 d 42 Input: Ad- hoc document set D D (e.g., all documents that contain barack obama) Output: k most interes'ng phrases in D 5 / 27

8 Model & Problem Statement Document collec'on D Documents are sequences of terms (e.g., d 12 = < a x z b l k a q x > ) D Phrases are sequences of terms (e.g., < a x z >, < a x z b >, < z b l k >, ) D a x z a x z a x z a x b zl k a x a x b z b zl k a x l a k a x a x b z q x b z b z l l a k a k a x l k a x a x q x b z b z q x b z l l a k a k a x l a k z x q x b z q x b z q x l b l a k a k k x l a k q x b zl q x q x d l a e q x a k d a q x k q x a q x s q x q a y d 12 d 37 d 42 Input: Ad- hoc document set D D (e.g., all documents that contain barack obama) Output: k most interes'ng phrases in D 5 / 27

9 Model & Problem Statement Document collec'on D Documents are sequences of terms (e.g., d 12 = < a x z b l k a q x > ) D Phrases are sequences of terms (e.g., < a x z >, < a x z b >, < z b l k >, ) D a x z a x z a x z a x b zl k a x a x b z b zl k a x l a k a x a x b z q x b z b z l l a k a k a x l k a x a x q x b z b z q x b z l l a k a k a x l a k z x q x b z q x b z q x l b l a k a k k x l a k q x b zl q x q x d l a e q x a k d a q x k q x a q x s q x q a y d 12 d 37 d 42 Input: Ad- hoc document set D D (e.g., all documents that contain barack obama) Output: k most interes'ng phrases in D 1. < z x z > 2. < e s q x > k. < d l e s q > 5 / 27

10 When Do We Consider a Phrase Interes'ng? We dis'nguish for a phrase p its local frequency freq(p, D ) in the ad- hoc document set D global frequency freq(p, D) in the document collec'on D The interes'ngness of phrase p is defined as I(p, D )= freq(p, D ) freq(p, D) We consider only phrases as relevant that are globally frequent, i.e., occur in at least τ documents from D have at most length m (e.g., 5 in our experiments) Note: Our methods adapt to other defini'ons of interes'ngness (e.g., based on log- likelihood ra'os or PMI) 6 / 27

11 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 7 / 27

12 Overview We precompute the global frequency freq(p, D) for any relevant phrase p using an apriori- style algorithm and keep a phrase dic'onary that maps p freq(p, D) Our methods rely on forward indexes that map each document to a representa'on of its content d 12 d 37 d 42 representa'on of d 12 s content representa'on of d 37 s content representa'on of d 42 s content For a given ad- hoc document set D our methods access the forward index for each d D merge the D content representa'ons output the k most interes'ng phrases 8 / 27

13 Document Content Idea: Keep document content explicitly as a sequence of contained terms (or, term iden'fiers) d 12 d 37 d 42 < a x z b l k a q x > < z x z d l e s q x > < k x z d a k q a y > Benefit: Space- efficient Drawbacks: Enumera'on of phrases needed (including globally- infrequent ones) Requires phrase dic'onary with global frequencies freq(p, D) 9 / 27

14 Phrases Idea: Keep globally- frequent phrases contained in document in a consistent (e.g., lexicographical) order d 12 d 37 d 42 < a > < a x > < a x z > < b > < b l > < d > < d l > < d l e > < e > < e s > < a > < a k > < a y > < d > < d a > Benefits: Considers only globally- frequent phrases Consistent sort order facilitates merging Drawbacks: Space- inefficient Requires phrase dic'onary with global frequencies freq(p, D) 10/ 27

15 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) where p is the last phrase encountered 11/ 27

16 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) where p is the last phrase encountered 11/ 27

17 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) where p is the last phrase encountered 11/ 27

18 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) D freq(< s >, D) = 3 6 where p is the last phrase encountered 11/ 27

19 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Benefits: Early termina'on possible when no unseen phrase can make it into the top- k of most interes'ng phrases Self- contained (i.e., no phrase dic'onary needed) Drawback: Space- inefficient 12/ 27

20 Prefix- Maximal Phrases Observa4on: Globally- frequent phrases are owen redundant and therefore do not need to be kept explicitly d 12 d 37 d 42 < a > < a x > < a x z > < b > < b l > < d > < d l > < d l e > < e > < e s > < a > < a k > < a y > < d > < d a > Defini4on: A phrase p is prefix- maximal in document d if p is globally- frequent d does not contain a globally- frequent phrase p of which p is a prefix A prefix- maximal phrase (e.g., < a x z >) can represent all its prefixes (i.e., < a > and < a x >) and it s guaranteed that they re globally- frequent and contained in d 13/ 27

21 Prefix- Maximal Phrases Observa4on: Globally- frequent phrases are owen redundant and therefore do not need to be kept explicitly d 12 d 37 d 42 < a > < a x > < a x z > < b > < b l > < d > < d l > < d l e > < e > < e s > < a > < a k > < a y > < d > < d a > Defini4on: A phrase p is prefix- maximal in document d if p is globally- frequent d does not contain a globally- frequent phrase p of which p is a prefix A prefix- maximal phrase (e.g., < a x z >) can represent all its prefixes (i.e., < a > and < a x >) and it s guaranteed that they re globally- frequent and contained in d 13/ 27

22 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a x z > < b l > < d l e > < e s > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27

23 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a > < a x > < a x z > < b l > < d l e > < e s > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27

24 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a > < a x > < a x z > < b l > < d > < d l > < d l e > < e s > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27

25 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a > < a x > < a x z > < b l > < d > < d l > < d l e > < e s > < a > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27

26 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 15/ 27

27 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27

28 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 D d 12, d 52, d 7, d < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27

29 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 D d 12, d 52, d 7, d < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27

30 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 D d 12, d 52, d 7, d < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27

31 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 D d 12, d 52, d 7, d < x z k > d 2737, d 57 < x z z > d 12 16/ 27

32 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 D d 12, d 52, d 7, d < x z z > d 12 16/ 27

33 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 D d 12, d 52, d 7, d / 27

34 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27

35 MCX: Mul'dimensional Content explora'on Op4miza4on: Approximate fast set intersec'ons through order randomiza'on Benefits: Early termina'on possible when no unseen phrase can make it into the top- k of most frequent phrases performs well for very large ad- hoc document sets D Drawback: considers only frequency as a coarse- grained pre- filter; re- ranks iden'fied frequent phrases based on interes'ngness Reference: A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Mul$dimensional Content explora$on. In VLDB 08 17/ 27

36 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 18/ 27

37 Experimental Evalua'on Dataset: The New York Times Annotated Corpus consis'ng of 1.8M newspaper ar'cles published between 1987 and 2007 Queries to determine ad- hoc document sets based on person- related (e.g., steve jobs, hillary clinton, ) news- related (e.g., world trade center, world cup, ) based on metadata (e.g., /travel/des4na4ons/europe, ) Implementa4on in Java represents data compactly (e.g., variable- length encoding) System: SUN server- class machine (4 CPUs / 16Gb RAM / RAID- 5 / Windows 2003 Server) 19/ 27

38 Examples of Interes'ng Phrases Query: john lennon 1) since john lennon was assassinated 2) lennon s childhood 3) post beatles work Query: bob marley 1) music of bob marley 2) marley the jamaican musician 3) i shot the sheriff Query: john mccain 1) to beat al gore like 2) 2000 campaign in arizona 3) the senior senator from virginia 20/ 27

39 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 2e Minimum Support Threshold! 21/ 27

40 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 2e Gb Minimum Support Threshold! 21/ 27

41 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27

42 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 5.64 Gb 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27

43 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX Gb 5.64 Gb 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27

44 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX Gb 5.64 Gb 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27

45 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX k = 100 τ = 10 Input Cardinality D 22/ 27

46 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27

47 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27

48 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,779 ms 3,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27

49 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,575 ms 14,779 ms 3,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27

50 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,575 ms 14,779 ms 3,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27

51 Wall- Clock Times for Different Values of k Wallclock Time (ms) Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination τ = 10 D = 500 Result Size k 23/ 27

52 Wall- Clock Times for Different Values of k Wallclock Time (ms) τ = 10 D = Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination Result Size k 357 ms 23/ 27

53 Wall- Clock Times for Different Values of k ms Wallclock Time (ms) τ = 10 D = Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination Result Size k 357 ms 23/ 27

54 Wall- Clock Times for Different Values of k ms Wallclock Time (ms) τ = 10 D = Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination Result Size k 357 ms 23/ 27

55 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 24/ 27

56 Conclusion Efficient iden'fica'on of interes'ng phrases on ad- hoc document sets is a challenging research problem Our methods to tackle this problem are based on forward indexes differ in how they represent document contents Frequency- Ordered Phrases allow for early termina'on Prefix- Maximal Phrases result in a very compact index Experiments on real- world dataset show that our methods work in prac'ce and outperform the state- of- the- art method substan'ally on realis'c inputs 25/ 27

57 Ongoing Research Alterna4ve scenario: Iden'fy phrases that best dis'nguish two ad- hoc document sets (e.g., barack obama vs. george bush or recently published vs. all documents on UEFA) Implement our index structures and pre- computa'ons using Hadoop (or another MapReduce implementa'on) Support grouping of almost- iden'cal phrases (e.g., george bush said vs. george w. bush said) Make use of NLP techniques such as part- of- speech tagging (e.g., to abstract from concrete pronouns) 26/ 27

58 Thank you! Ques'ons?

59 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 28/ 27

60 Wall- Clock Times for Different Values of τ Wallclock Time (ms) Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases ms k = 100 D = 500 Minimum Support Threshold! 28/ 27

61 Wall- Clock Times for Different Values of τ Wallclock Time (ms) Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases 784 ms 360 ms k = 100 D = 500 Minimum Support Threshold! 28/ 27

62 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 1,125 ms 784 ms 360 ms 28/ 27

63 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 4,075 ms 1,125 ms 784 ms 360 ms 28/ 27

64 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 4,075 ms 1,125 ms 784 ms 360 ms 28/ 27

Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini

Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a]

More information

Be#er Generaliza,on with Forecasts. Tom Schaul Mark Ring

Be#er Generaliza,on with Forecasts. Tom Schaul Mark Ring Be#er Generaliza,on with Forecasts Tom Schaul Mark Ring Which Representa,ons? Type: Feature- based representa,ons (state = feature vector) Quality 1: Usefulness for linear policies Quality 2: Generaliza,on

More information

A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing

A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing Nov 12 2014 Hien To, Gabriel Ghinita, Cyrus Shahabi VLDB 2014 1 Mo/va/on Ubiquity of mobile users 6.5 billion mobile subscrip/ons,

More information

Mind the Gap: Large-Scale Frequent Sequence Mining

Mind the Gap: Large-Scale Frequent Sequence Mining Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos {miliaraki, kberberi, rgemulla, zoupanos}@mpi-inf.mpg.de Max Planck Institute for Informatics

More information

Q1 current best prac2ce

Q1 current best prac2ce Group- A Q1 current best prac2ce Star2ng from some common molecular representa2on with bond orders, configura2on on chiral centers (e.g. ChemDraw, SMILES) NEW! PDB should become resource for refinement

More information

SKA machine learning perspec1ves for imaging, processing and analysis

SKA machine learning perspec1ves for imaging, processing and analysis 1 SKA machine learning perspec1ves for imaging, processing and analysis Slava Voloshynovskiy Stochas1c Informa1on Processing Group University of Geneva Switzerland with contribu,on of: D. Kostadinov, S.

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

More information

APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS

APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS Yizhou Sun College of Computer and Information Science Northeastern University yzsun@ccs.neu.edu July 25, 2015 Heterogeneous Information Networks

More information

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz Computa(on of Similarity Similarity Search as Computa(on Stoyan Mihov and Klaus U. Schulz Roadmap What is similarity computa(on? NLP examples of similarity computa(on Machine transla(on Speech recogni(on

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka Visualizing Big Data on Maps: Emerging Tools and Techniques Ilir Bejleri, Sanjay Ranka Topics Web GIS Visualization Big Data GIS Performance Maps in Data Visualization Platforms Next: Web GIS Visualization

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization

More information

Seman&cs with Dense Vectors. Dorota Glowacka

Seman&cs with Dense Vectors. Dorota Glowacka Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values

More information

PSAAP Project Stanford

PSAAP Project Stanford PSAAP Project QMU @ Stanford Component Analysis and rela:on to Full System Simula:ons 1 What do we want to predict? Objec:ve: predic:on of the unstart limit expressed as probability of unstart (or alterna:vely

More information

A Comparative Study of Current Clinical NLP Systems on Handling Abbreviations

A Comparative Study of Current Clinical NLP Systems on Handling Abbreviations A Comparative Study of Current Clinical NLP Systems on Handling Abbreviations Yonghui Wu 1 PhD, Joshua C. Denny 2 MD MS, S. Trent Rosenbloom 2 MD MPH, Randolph A. Miller 2 MD, Dario A. Giuse 2 Dr.Ing,

More information

text statistics October 24, 2018 text statistics 1 / 20

text statistics October 24, 2018 text statistics 1 / 20 text statistics October 24, 2018 text statistics 1 / 20 Overview 1 2 text statistics 2 / 20 Outline 1 2 text statistics 3 / 20 Model collection: The Reuters collection symbol statistic value N documents

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Rainfall data analysis and storm prediction system

Rainfall data analysis and storm prediction system Rainfall data analysis and storm prediction system SHABARIRAM, M. E. Available from Sheffield Hallam University Research Archive (SHURA) at: http://shura.shu.ac.uk/15778/ This document is the author deposited

More information

FP-growth and PrefixSpan

FP-growth and PrefixSpan FP-growth and PrefixSpan n Challenges of Frequent Pattern Mining n Improving Apriori n Fp-growth n Fp-tree n Mining frequent patterns with FP-tree n PrefixSpan Challenges of Frequent Pattern Mining n Challenges

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

arxiv: v1 [cs.db] 2 Sep 2014

arxiv: v1 [cs.db] 2 Sep 2014 An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de

More information

Predic've Analy'cs for Energy Systems State Es'ma'on

Predic've Analy'cs for Energy Systems State Es'ma'on 1 Predic've Analy'cs for Energy Systems State Es'ma'on Presenter: Team: Yingchen (YC) Zhang Na/onal Renewable Energy Laboratory (NREL) Andrey Bernstein, Rui Yang, Yu Xie, Anthony Florita, Changfu Li, ScoD

More information

Intelligent Data Analysis Lecture Notes on Document Mining

Intelligent Data Analysis Lecture Notes on Document Mining Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Ensemble of Climate Models

Ensemble of Climate Models Ensemble of Climate Models Claudia Tebaldi Climate Central and Department of Sta7s7cs, UBC Reto Knu>, Reinhard Furrer, Richard Smith, Bruno Sanso Outline Mul7 model ensembles (MMEs) a descrip7on at face

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

NetBox: A Probabilistic Method for Analyzing Market Basket Data

NetBox: A Probabilistic Method for Analyzing Market Basket Data NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato

More information

Statistical Privacy For Privacy Preserving Information Sharing

Statistical Privacy For Privacy Preserving Information Sharing Statistical Privacy For Privacy Preserving Information Sharing Johannes Gehrke Cornell University http://www.cs.cornell.edu/johannes Joint work with: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

TASM: Top-k Approximate Subtree Matching

TASM: Top-k Approximate Subtree Matching TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,

More information

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Martin Takáč The University of Edinburgh Joint work with Peter Richtárik (Edinburgh University) Selin

More information

Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies

Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies Florian Tramèr, Zhicong Huang, Erman Ayday, Jean- Pierre Hubaux ACM CCS 205 Denver, Colorado,

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Least Mean Squares Regression. Machine Learning Fall 2017

Least Mean Squares Regression. Machine Learning Fall 2017 Least Mean Squares Regression Machine Learning Fall 2017 1 Lecture Overview Linear classifiers What func?ons do linear classifiers express? Least Squares Method for Regression 2 Where are we? Linear classifiers

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Geographic Informa0on Retrieval: Are we making progress? Ross Purves, University of Zurich

Geographic Informa0on Retrieval: Are we making progress? Ross Purves, University of Zurich Geographic Informa0on Retrieval: Are we making progress? Ross Purves, University of Zurich Outline Where I m coming from: defini0ons, experiences and requirements for GIR Brief lis>ng of one (of many possible)

More information

WDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud

WDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud WDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud * In Kee Kim, * Jacob Steele, + Anthony Castronova, * Jonathan Goodall, and * Marty Humphrey * University of Virginia + Utah

More information

Behavioral Data Mining. Lecture 2

Behavioral Data Mining. Lecture 2 Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events

More information

From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis

From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis paris@illinois.edu CS & ECE Depts. UNIVERSITY OF ILLINOIS AT URBANA CHAMPAIGN Some Mo2va2on Why do we separate signals? I don t

More information

Par$$oned Elias- Fano Indexes

Par$$oned Elias- Fano Indexes Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Machine learning for Dynamic Social Network Analysis

Machine learning for Dynamic Social Network Analysis Machine learning for Dynamic Social Network Analysis Manuel Gomez Rodriguez Max Planck Ins7tute for So;ware Systems UC3M, MAY 2017 Interconnected World SOCIAL NETWORKS TRANSPORTATION NETWORKS WORLD WIDE

More information

Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models Markov Chains and Hidden Markov Models CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018 Soleymani Slides are based on Klein and Abdeel, CS188, UC Berkeley. Reasoning

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan ArcGIS GeoAnalytics Server: An Introduction Sarah Ambrose and Ravi Narayanan Overview Introduction Demos Analysis Concepts using GeoAnalytics Server GeoAnalytics Data Sources GeoAnalytics Server Administration

More information

Real- Time Management of Complex Resource Alloca8on Systems: Necessity, Achievements and Further Challenges

Real- Time Management of Complex Resource Alloca8on Systems: Necessity, Achievements and Further Challenges Real- Time Management of Complex Resource Alloca8on Systems: Necessity, Achievements and Further Challenges Spyros Revelio8s School of Industrial & Systems Engineering Georgia Ins8tute of Technology DCDS

More information

The use of GIS tools for analyzing eye- movement data

The use of GIS tools for analyzing eye- movement data The use of GIS tools for analyzing eye- movement data Tomasz Opach Jan Ke>l Rød some facts In recent years an eye- tracking approach has become a common technique for collec>ng data in empirical user studies

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

A Convolutional Neural Network-based

A Convolutional Neural Network-based A Convolutional Neural Network-based Model for Knowledge Base Completion Dat Quoc Nguyen Joint work with: Dai Quoc Nguyen, Tu Dinh Nguyen and Dinh Phung April 16, 2018 Introduction Word vectors learned

More information

Web Structure Mining Nodes, Links and Influence

Web Structure Mining Nodes, Links and Influence Web Structure Mining Nodes, Links and Influence 1 Outline 1. Importance of nodes 1. Centrality 2. Prestige 3. Page Rank 4. Hubs and Authority 5. Metrics comparison 2. Link analysis 3. Influence model 1.

More information

FREQUENT PATTERN SPACE MAINTENANCE: THEORIES & ALGORITHMS

FREQUENT PATTERN SPACE MAINTENANCE: THEORIES & ALGORITHMS FREQUENT PATTERN SPACE MAINTENANCE: THEORIES & ALGORITHMS FENG MENGLING SCHOOL OF ELECTRICAL & ELECTRONIC ENGINEERING NANYANG TECHNOLOGICAL UNIVERSITY 2009 FREQUENT PATTERN SPACE MAINTENANCE: THEORIES

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Associa'on Rule Mining

Associa'on Rule Mining Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

Introduction to ArcGIS Server Development

Introduction to ArcGIS Server Development Introduction to ArcGIS Server Development Kevin Deege,, Rob Burke, Kelly Hutchins, and Sathya Prasad ESRI Developer Summit 2008 1 Schedule Introduction to ArcGIS Server Rob and Kevin Questions Break 2:15

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

Introduction of Recruit

Introduction of Recruit Apr. 11, 2018 Introduction of Recruit We provide various kinds of online services from job search to hotel reservations across the world. Housing Beauty Travel Life & Local O2O Education Automobile Bridal

More information

CSE 473: Ar+ficial Intelligence

CSE 473: Ar+ficial Intelligence CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule. CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

Cri$ques Ø 5 cri&ques in total Ø Each with 6 points

Cri$ques Ø 5 cri&ques in total Ø Each with 6 points Cri$ques Ø 5 cri&ques in total Ø Each with 6 points 1 Distributed Applica$on Alloca$on in Shared Sensor Networks Chengjie Wu, You Xu, Yixin Chen, Chenyang Lu Shared Sensor Network Example in San Francisco

More information

Discovering Spatial and Temporal Links among RDF Data Panayiotis Smeros and Manolis Koubarakis

Discovering Spatial and Temporal Links among RDF Data Panayiotis Smeros and Manolis Koubarakis Discovering Spatial and Temporal Links among RDF Data Panayiotis Smeros and Manolis Koubarakis WWW2016 Workshop: Linked Data on the Web (LDOW2016) April 12, 2016 - Montréal, Canada Outline Introduction

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

Internal link prediction: a new approach for predicting links in bipartite graphs

Internal link prediction: a new approach for predicting links in bipartite graphs Internal link prediction: a new approach for predicting links in bipartite graphs Oussama llali, lémence Magnien and Matthieu Latapy LIP6 NRS and Université Pierre et Marie urie (UPM Paris 6) 4 place Jussieu

More information

December 3, Dipartimento di Informatica, Università di Torino. Felicittà. Visualizing and Estimating Happiness in

December 3, Dipartimento di Informatica, Università di Torino. Felicittà. Visualizing and Estimating Happiness in : : Dipartimento di Informatica, Università di Torino December 3, 2013 : Outline : from social media Social media users generated contents represent the human behavior in everyday life, but... how to analyze

More information

Removing trivial associations in association rule discovery

Removing trivial associations in association rule discovery Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50 Estimation of the Click Volume by Large Scale Regression Analysis Yuri Lifshits, Dirk Nowotka [International Computer Science Symposium in Russia 07, Ekaterinburg] Presented by: Aditya Menon UCSD May 15,

More information

Sequential Feature Explanations for Anomaly Detection

Sequential Feature Explanations for Anomaly Detection Sequential Feature Explanations for Anomaly Detection Md Amran Siddiqui, Alan Fern, Thomas G. Die8erich and Weng-Keen Wong School of EECS Oregon State University Anomaly Detection Anomalies : points that

More information

New Directions in Computer Science

New Directions in Computer Science New Directions in Computer Science John Hopcroft Cornell University Time of change The information age is a revolution that is changing all aspects of our lives. Those individuals, institutions, and nations

More information

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben Cartere8e University of Delaware h8p://ir.cis.udel.edu/ictir13tutorial Hypotheses and Experiments Hypothesis: Using an SVM for classifica$on will

More information

Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data

Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data Nikita Zhiltsov 1,2 Alexander Kotov 3 Fedor Nikolaev 3 1 Kazan Federal University 2 Textocat 3 Textual Data Analytics

More information

A Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith

A Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith A Probabilistic Model for Canonicalizing Named Entity Mentions Dani Yogatama Yanchuan Sim Noah A. Smith Introduction Model Experiments Conclusions Outline Introduction Model Experiments Conclusions Outline

More information

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview Algorithmic Methods of Data Mining, Fall 2005, Course overview 1 Course overview lgorithmic Methods of Data Mining, Fall 2005, Course overview 1 T-61.5060 Algorithmic methods of data mining (3 cp) P T-61.5060

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

Geodatabase An Overview

Geodatabase An Overview Federal GIS Conference February 9 10, 2015 Washington, DC Geodatabase An Overview Ralph Denkenberger - esri Session Path The Geodatabase - What is it? - Why use it? - What types are there? Inside the Geodatabase

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar NATURAL LANGUAGE PROCESSING Dr. G. Bharadwaja Kumar Sentence Boundary Markers Many natural language processing (NLP) systems generally take a sentence as an input unit part of speech (POS) tagging, chunking,

More information

Mobility Analytics through Social and Personal Data. Pierre Senellart

Mobility Analytics through Social and Personal Data. Pierre Senellart Mobility Analytics through Social and Personal Data Pierre Senellart Session: Big Data & Transport Business Convention on Big Data Université Paris-Saclay, 25 novembre 2015 Analyzing Transportation and

More information

Objec&ves. Review. Data structure: Heaps Data structure: Graphs. What is a priority queue? What is a heap?

Objec&ves. Review. Data structure: Heaps Data structure: Graphs. What is a priority queue? What is a heap? Objec&ves Data structure: Heaps Data structure: Graphs Submit problem set 2 1 Review What is a priority queue? What is a heap? Ø Proper&es Ø Implementa&on What is the process for finding the smallest element

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08

More information

LBT for Procedural and Reac1ve Systems Part 2: Reac1ve Systems Basic Theory

LBT for Procedural and Reac1ve Systems Part 2: Reac1ve Systems Basic Theory LBT for Procedural and Reac1ve Systems Part 2: Reac1ve Systems Basic Theory Karl Meinke, karlm@kth.se School of Computer Science and Communica:on KTH Stockholm 0. Overview of the Lecture 1. Learning Based

More information

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13 Computer Vision Pa0ern Recogni4on Concepts Part I Luis F. Teixeira MAP- i 2012/13 What is it? Pa0ern Recogni4on Many defini4ons in the literature The assignment of a physical object or event to one of

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum Graphical Models Lecture 1: Mo4va4on and Founda4ons Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. Board work Expert systems the desire for probability

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra

More information

DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery

DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery Yingjie Hu, Grant McKenzie, Jiue-An Yang, Song Gao, Amin Abdalla, and

More information

STATISTICAL PERFORMANCE

STATISTICAL PERFORMANCE STATISTICAL PERFORMANCE PROVISIONING AND ENERGY EFFICIENCY IN DISTRIBUTED COMPUTING SYSTEMS Nikzad Babaii Rizvandi 1 Supervisors: Prof.Albert Y.Zomaya Prof. Aruna Seneviratne OUTLINE Introduction Background

More information

Project 2: Hadoop PageRank Cloud Computing Spring 2017

Project 2: Hadoop PageRank Cloud Computing Spring 2017 Project 2: Hadoop PageRank Cloud Computing Spring 2017 Professor Judy Qiu Goal This assignment provides an illustration of PageRank algorithms and Hadoop. You will then blend these applications by implementing

More information

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion : An Efficient Matching-Based Method for Error-Tolerant Autocompletion Dong Deng Guoliang Li He Wen H. V. Jagadish Jianhua Feng Department of Computer Science, Tsinghua National Laboratory for Information

More information

MobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL

MobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL MobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL Jing (Selena) He Department of Computer Science, Kennesaw State University Shouling Ji,

More information

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Minimum Edit Distance. Defini'on of Minimum Edit Distance Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences

More information

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso Machine Learning Data Mining CS/CNS/EE 155 Lecture 4: Recent Applica6ons of Lasso 1 Today: Two Recent Applica6ons Cancer Detec0on Personaliza0on via twi9er music( Biden( soccer( Labour( Applica6ons of

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information