Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs
|
|
- Gary Hancock
- 5 years ago
- Views:
Transcription
1 Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs Klaus Berberich inf.mpg.de) Joint work with: Srikanta Bedathur, Jens DiIrich, Nikos Mamoulis, and Gerhard Weikum (originally presented at VLDB 2010)
2 Mo'va'on Con'nuously growing wealth of unstructured text data, e.g.: e- mail and instant messaging social media content (e.g., twiier and facebook) customer reports and product reviews general Web (e.g., news portals) Search on this data is understood and under control (e.g., find all news ar'cles that men'on barack obama) Gaining insights is tedious and a task mostly lew to users (e.g., iden'fy characteris'c quota'ons by barack obama) 2 / 27
3 Mo'va'on Tools available today include Tag clouds Search- result snippets Limita4ons: focus on single terms no en'ty names, quota'ons, etc. frequency- based not necessarily interes'ng (e.g., stopwords) global analysis no ad- hoc document sets (e.g., determined by query) Our idea: Iden'fy interes'ng phrases that dis'nguish an ad- hoc document set from the document collec'on as a whole 3 / 27
4 Mo'va'on Tools available today include Tag clouds Search- result snippets Limita4ons: focus on single terms no en'ty names, quota'ons, etc. frequency- based not necessarily interes'ng (e.g., stopwords) global analysis no ad- hoc document sets (e.g., determined by query) Our idea: Iden'fy interes'ng phrases that dis'nguish an ad- hoc document set from the document collec'on as a whole 3 / 27
5 Mo'va'on Tools available today include Tag clouds Search- result snippets Limita4ons: focus on single terms no en'ty names, quota'ons, etc. frequency- based not necessarily interes'ng (e.g., stopwords) global analysis no ad- hoc document sets (e.g., determined by query) Our idea: Iden'fy interes'ng phrases that dis'nguish an ad- hoc document set from the document collec'on as a whole 3 / 27
6 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 4 / 27
7 Model & Problem Statement Document collec'on D Documents are sequences of terms (e.g., d 12 = < a x z b l k a q x > ) D Phrases are sequences of terms (e.g., < a x z >, < a x z b >, < z b l k >, ) a x z a x z a x z a x b zl k a x a x b z b zl k a x a x l a k a x b zl q x b z b z l a x a k a k a x l k a x q x b zl q x b z b z l a k a k a x l a k z x q x b l q x b z b z q x l a k a k k x l a k q x b zl q x q x d l a e a kq x q x d a kq x a q x s q x q a y d 12 d 37 d 42 Input: Ad- hoc document set D D (e.g., all documents that contain barack obama) Output: k most interes'ng phrases in D 5 / 27
8 Model & Problem Statement Document collec'on D Documents are sequences of terms (e.g., d 12 = < a x z b l k a q x > ) D Phrases are sequences of terms (e.g., < a x z >, < a x z b >, < z b l k >, ) D a x z a x z a x z a x b zl k a x a x b z b zl k a x l a k a x a x b z q x b z b z l l a k a k a x l k a x a x q x b z b z q x b z l l a k a k a x l a k z x q x b z q x b z q x l b l a k a k k x l a k q x b zl q x q x d l a e q x a k d a q x k q x a q x s q x q a y d 12 d 37 d 42 Input: Ad- hoc document set D D (e.g., all documents that contain barack obama) Output: k most interes'ng phrases in D 5 / 27
9 Model & Problem Statement Document collec'on D Documents are sequences of terms (e.g., d 12 = < a x z b l k a q x > ) D Phrases are sequences of terms (e.g., < a x z >, < a x z b >, < z b l k >, ) D a x z a x z a x z a x b zl k a x a x b z b zl k a x l a k a x a x b z q x b z b z l l a k a k a x l k a x a x q x b z b z q x b z l l a k a k a x l a k z x q x b z q x b z q x l b l a k a k k x l a k q x b zl q x q x d l a e q x a k d a q x k q x a q x s q x q a y d 12 d 37 d 42 Input: Ad- hoc document set D D (e.g., all documents that contain barack obama) Output: k most interes'ng phrases in D 1. < z x z > 2. < e s q x > k. < d l e s q > 5 / 27
10 When Do We Consider a Phrase Interes'ng? We dis'nguish for a phrase p its local frequency freq(p, D ) in the ad- hoc document set D global frequency freq(p, D) in the document collec'on D The interes'ngness of phrase p is defined as I(p, D )= freq(p, D ) freq(p, D) We consider only phrases as relevant that are globally frequent, i.e., occur in at least τ documents from D have at most length m (e.g., 5 in our experiments) Note: Our methods adapt to other defini'ons of interes'ngness (e.g., based on log- likelihood ra'os or PMI) 6 / 27
11 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 7 / 27
12 Overview We precompute the global frequency freq(p, D) for any relevant phrase p using an apriori- style algorithm and keep a phrase dic'onary that maps p freq(p, D) Our methods rely on forward indexes that map each document to a representa'on of its content d 12 d 37 d 42 representa'on of d 12 s content representa'on of d 37 s content representa'on of d 42 s content For a given ad- hoc document set D our methods access the forward index for each d D merge the D content representa'ons output the k most interes'ng phrases 8 / 27
13 Document Content Idea: Keep document content explicitly as a sequence of contained terms (or, term iden'fiers) d 12 d 37 d 42 < a x z b l k a q x > < z x z d l e s q x > < k x z d a k q a y > Benefit: Space- efficient Drawbacks: Enumera'on of phrases needed (including globally- infrequent ones) Requires phrase dic'onary with global frequencies freq(p, D) 9 / 27
14 Phrases Idea: Keep globally- frequent phrases contained in document in a consistent (e.g., lexicographical) order d 12 d 37 d 42 < a > < a x > < a x z > < b > < b l > < d > < d l > < d l e > < e > < e s > < a > < a k > < a y > < d > < d a > Benefits: Considers only globally- frequent phrases Consistent sort order facilitates merging Drawbacks: Space- inefficient Requires phrase dic'onary with global frequencies freq(p, D) 10/ 27
15 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) where p is the last phrase encountered 11/ 27
16 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) where p is the last phrase encountered 11/ 27
17 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) where p is the last phrase encountered 11/ 27
18 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Interes'ngness of any unseen phrase is upper- bounded by min 1, D freq(p, D) D freq(< s >, D) = 3 6 where p is the last phrase encountered 11/ 27
19 Frequency- Ordered Phrases Idea: Keep contained globally- frequent phrases in ascending order of their embedded global frequencies freq(p, D) d 12 d 37 d 42 5 : < x z b > < z b > 6 : < q > < x > < x z > 7 : < z > 5 : < e s q > < s q x > 6 : < q > < s > < x > < x z > 5 : < a k q a > < k q a > 6 : < q > < x > < x z > Benefits: Early termina'on possible when no unseen phrase can make it into the top- k of most interes'ng phrases Self- contained (i.e., no phrase dic'onary needed) Drawback: Space- inefficient 12/ 27
20 Prefix- Maximal Phrases Observa4on: Globally- frequent phrases are owen redundant and therefore do not need to be kept explicitly d 12 d 37 d 42 < a > < a x > < a x z > < b > < b l > < d > < d l > < d l e > < e > < e s > < a > < a k > < a y > < d > < d a > Defini4on: A phrase p is prefix- maximal in document d if p is globally- frequent d does not contain a globally- frequent phrase p of which p is a prefix A prefix- maximal phrase (e.g., < a x z >) can represent all its prefixes (i.e., < a > and < a x >) and it s guaranteed that they re globally- frequent and contained in d 13/ 27
21 Prefix- Maximal Phrases Observa4on: Globally- frequent phrases are owen redundant and therefore do not need to be kept explicitly d 12 d 37 d 42 < a > < a x > < a x z > < b > < b l > < d > < d l > < d l e > < e > < e s > < a > < a k > < a y > < d > < d a > Defini4on: A phrase p is prefix- maximal in document d if p is globally- frequent d does not contain a globally- frequent phrase p of which p is a prefix A prefix- maximal phrase (e.g., < a x z >) can represent all its prefixes (i.e., < a > and < a x >) and it s guaranteed that they re globally- frequent and contained in d 13/ 27
22 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a x z > < b l > < d l e > < e s > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27
23 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a > < a x > < a x z > < b l > < d l e > < e s > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27
24 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a > < a x > < a x z > < b l > < d > < d l > < d l e > < e s > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27
25 Prefix- Maximal Phrases Idea: Keep contained prefix- maximal phrases in lexicographical order and extract prefixes on- the- fly d d 12 d < a > < a x > < a x z > < b l > < d > < d l > < d l e > < e s > < a > < a k > < a y > < d a > Benefit: Space- efficient Drawbacks: Extrac'on of prefixes entails addi'onal bookkeeping Requires phrase dic'onary with global frequencies freq(p, D) 14/ 27
26 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 15/ 27
27 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27
28 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 D d 12, d 52, d 7, d < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27
29 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 D d 12, d 52, d 7, d < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27
30 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 D d 12, d 52, d 7, d < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27
31 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 D d 12, d 52, d 7, d < x z k > d 2737, d 57 < x z z > d 12 16/ 27
32 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 D d 12, d 52, d 7, d < x z z > d 12 16/ 27
33 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 D d 12, d 52, d 7, d / 27
34 MCX: Mul'dimensional Content explora'on Objec4ve: Iden'fy k most frequent relevant phrases in a given ad- hoc document set D Idea: Build inverted index that maps p { d d contains p } and sorts phrases in descending order of freq(p, D) < a > d 12, d 2, d 89, d 122, d 7, d 128, d 876, d 4 < b > d 22, d 5, d 19, d 272, d 8, d 218, d 926 < c > d 6, d 51, d 29, d 92, d 41, d 928 < a i > d 12, d 52, d 7, d 218 < x z k > d 2737, d 57 < x z z > d 12 16/ 27
35 MCX: Mul'dimensional Content explora'on Op4miza4on: Approximate fast set intersec'ons through order randomiza'on Benefits: Early termina'on possible when no unseen phrase can make it into the top- k of most frequent phrases performs well for very large ad- hoc document sets D Drawback: considers only frequency as a coarse- grained pre- filter; re- ranks iden'fied frequent phrases based on interes'ngness Reference: A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Mul$dimensional Content explora$on. In VLDB 08 17/ 27
36 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 18/ 27
37 Experimental Evalua'on Dataset: The New York Times Annotated Corpus consis'ng of 1.8M newspaper ar'cles published between 1987 and 2007 Queries to determine ad- hoc document sets based on person- related (e.g., steve jobs, hillary clinton, ) news- related (e.g., world trade center, world cup, ) based on metadata (e.g., /travel/des4na4ons/europe, ) Implementa4on in Java represents data compactly (e.g., variable- length encoding) System: SUN server- class machine (4 CPUs / 16Gb RAM / RAID- 5 / Windows 2003 Server) 19/ 27
38 Examples of Interes'ng Phrases Query: john lennon 1) since john lennon was assassinated 2) lennon s childhood 3) post beatles work Query: bob marley 1) music of bob marley 2) marley the jamaican musician 3) i shot the sheriff Query: john mccain 1) to beat al gore like 2) 2000 campaign in arizona 3) the senior senator from virginia 20/ 27
39 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 2e Minimum Support Threshold! 21/ 27
40 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 2e Gb Minimum Support Threshold! 21/ 27
41 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27
42 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX 5.64 Gb 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27
43 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX Gb 5.64 Gb 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27
44 Index Sizes for Different Values of τ Index Size (bytes) 1.4e e+10 1e+10 8e+09 6e+09 4e+09 Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX Gb 5.64 Gb 4.41 Gb 2e Minimum Support Threshold! 1.80 Gb 21/ 27
45 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX k = 100 τ = 10 Input Cardinality D 22/ 27
46 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27
47 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27
48 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,779 ms 3,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27
49 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,575 ms 14,779 ms 3,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27
50 Wall- Clock Times for Different Values of D 1e+06 Wallclock Time (ms) Document Content 100 Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases MCX ,575 ms 14,779 ms 3,500 ms 1,030 ms k = 100 τ = 10 Input Cardinality D 22/ 27
51 Wall- Clock Times for Different Values of k Wallclock Time (ms) Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination τ = 10 D = 500 Result Size k 23/ 27
52 Wall- Clock Times for Different Values of k Wallclock Time (ms) τ = 10 D = Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination Result Size k 357 ms 23/ 27
53 Wall- Clock Times for Different Values of k ms Wallclock Time (ms) τ = 10 D = Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination Result Size k 357 ms 23/ 27
54 Wall- Clock Times for Different Values of k ms Wallclock Time (ms) τ = 10 D = Frequency-Ordered Phrases Frequency-Ordered Phrases w/ EarlyTermination Result Size k 357 ms 23/ 27
55 Outline Mo'va'on Problem Statement Our Approach Prior Art Experimental Evalua'on Conclusion & Ongoing Research 24/ 27
56 Conclusion Efficient iden'fica'on of interes'ng phrases on ad- hoc document sets is a challenging research problem Our methods to tackle this problem are based on forward indexes differ in how they represent document contents Frequency- Ordered Phrases allow for early termina'on Prefix- Maximal Phrases result in a very compact index Experiments on real- world dataset show that our methods work in prac'ce and outperform the state- of- the- art method substan'ally on realis'c inputs 25/ 27
57 Ongoing Research Alterna4ve scenario: Iden'fy phrases that best dis'nguish two ad- hoc document sets (e.g., barack obama vs. george bush or recently published vs. all documents on UEFA) Implement our index structures and pre- computa'ons using Hadoop (or another MapReduce implementa'on) Support grouping of almost- iden'cal phrases (e.g., george bush said vs. george w. bush said) Make use of NLP techniques such as part- of- speech tagging (e.g., to abstract from concrete pronouns) 26/ 27
58 Thank you! Ques'ons?
59 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 28/ 27
60 Wall- Clock Times for Different Values of τ Wallclock Time (ms) Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases ms k = 100 D = 500 Minimum Support Threshold! 28/ 27
61 Wall- Clock Times for Different Values of τ Wallclock Time (ms) Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases 784 ms 360 ms k = 100 D = 500 Minimum Support Threshold! 28/ 27
62 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 1,125 ms 784 ms 360 ms 28/ 27
63 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 4,075 ms 1,125 ms 784 ms 360 ms 28/ 27
64 Wall- Clock Times for Different Values of τ Wallclock Time (ms) k = 100 D = Document Content Phrases Frequency-Ordered Phrases Prefix-Maximal Phrases Minimum Support Threshold! 4,075 ms 1,125 ms 784 ms 360 ms 28/ 27
Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini
Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a]
More informationBe#er Generaliza,on with Forecasts. Tom Schaul Mark Ring
Be#er Generaliza,on with Forecasts Tom Schaul Mark Ring Which Representa,ons? Type: Feature- based representa,ons (state = feature vector) Quality 1: Usefulness for linear policies Quality 2: Generaliza,on
More informationA Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing
A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing Nov 12 2014 Hien To, Gabriel Ghinita, Cyrus Shahabi VLDB 2014 1 Mo/va/on Ubiquity of mobile users 6.5 billion mobile subscrip/ons,
More informationMind the Gap: Large-Scale Frequent Sequence Mining
Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos {miliaraki, kberberi, rgemulla, zoupanos}@mpi-inf.mpg.de Max Planck Institute for Informatics
More informationQ1 current best prac2ce
Group- A Q1 current best prac2ce Star2ng from some common molecular representa2on with bond orders, configura2on on chiral centers (e.g. ChemDraw, SMILES) NEW! PDB should become resource for refinement
More informationSKA machine learning perspec1ves for imaging, processing and analysis
1 SKA machine learning perspec1ves for imaging, processing and analysis Slava Voloshynovskiy Stochas1c Informa1on Processing Group University of Geneva Switzerland with contribu,on of: D. Kostadinov, S.
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More informationAPPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS
APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS Yizhou Sun College of Computer and Information Science Northeastern University yzsun@ccs.neu.edu July 25, 2015 Heterogeneous Information Networks
More informationComputa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz
Computa(on of Similarity Similarity Search as Computa(on Stoyan Mihov and Klaus U. Schulz Roadmap What is similarity computa(on? NLP examples of similarity computa(on Machine transla(on Speech recogni(on
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationVisualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka
Visualizing Big Data on Maps: Emerging Tools and Techniques Ilir Bejleri, Sanjay Ranka Topics Web GIS Visualization Big Data GIS Performance Maps in Data Visualization Platforms Next: Web GIS Visualization
More informationNatural Language Processing
Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationPSAAP Project Stanford
PSAAP Project QMU @ Stanford Component Analysis and rela:on to Full System Simula:ons 1 What do we want to predict? Objec:ve: predic:on of the unstart limit expressed as probability of unstart (or alterna:vely
More informationA Comparative Study of Current Clinical NLP Systems on Handling Abbreviations
A Comparative Study of Current Clinical NLP Systems on Handling Abbreviations Yonghui Wu 1 PhD, Joshua C. Denny 2 MD MS, S. Trent Rosenbloom 2 MD MPH, Randolph A. Miller 2 MD, Dario A. Giuse 2 Dr.Ing,
More informationtext statistics October 24, 2018 text statistics 1 / 20
text statistics October 24, 2018 text statistics 1 / 20 Overview 1 2 text statistics 2 / 20 Outline 1 2 text statistics 3 / 20 Model collection: The Reuters collection symbol statistic value N documents
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationRainfall data analysis and storm prediction system
Rainfall data analysis and storm prediction system SHABARIRAM, M. E. Available from Sheffield Hallam University Research Archive (SHURA) at: http://shura.shu.ac.uk/15778/ This document is the author deposited
More informationFP-growth and PrefixSpan
FP-growth and PrefixSpan n Challenges of Frequent Pattern Mining n Improving Apriori n Fp-growth n Fp-tree n Mining frequent patterns with FP-tree n PrefixSpan Challenges of Frequent Pattern Mining n Challenges
More informationMeelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05
Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population
More informationarxiv: v1 [cs.db] 2 Sep 2014
An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de
More informationPredic've Analy'cs for Energy Systems State Es'ma'on
1 Predic've Analy'cs for Energy Systems State Es'ma'on Presenter: Team: Yingchen (YC) Zhang Na/onal Renewable Energy Laboratory (NREL) Andrey Bernstein, Rui Yang, Yu Xie, Anthony Florita, Changfu Li, ScoD
More informationIntelligent Data Analysis Lecture Notes on Document Mining
Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationEnsemble of Climate Models
Ensemble of Climate Models Claudia Tebaldi Climate Central and Department of Sta7s7cs, UBC Reto Knu>, Reinhard Furrer, Richard Smith, Bruno Sanso Outline Mul7 model ensembles (MMEs) a descrip7on at face
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationNetBox: A Probabilistic Method for Analyzing Market Basket Data
NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato
More informationStatistical Privacy For Privacy Preserving Information Sharing
Statistical Privacy For Privacy Preserving Information Sharing Johannes Gehrke Cornell University http://www.cs.cornell.edu/johannes Joint work with: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationTASM: Top-k Approximate Subtree Matching
TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,
More informationSparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations
Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Martin Takáč The University of Edinburgh Joint work with Peter Richtárik (Edinburgh University) Selin
More informationDifferen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies
Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies Florian Tramèr, Zhicong Huang, Erman Ayday, Jean- Pierre Hubaux ACM CCS 205 Denver, Colorado,
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationLeast Mean Squares Regression. Machine Learning Fall 2017
Least Mean Squares Regression Machine Learning Fall 2017 1 Lecture Overview Linear classifiers What func?ons do linear classifiers express? Least Squares Method for Regression 2 Where are we? Linear classifiers
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationGeographic Informa0on Retrieval: Are we making progress? Ross Purves, University of Zurich
Geographic Informa0on Retrieval: Are we making progress? Ross Purves, University of Zurich Outline Where I m coming from: defini0ons, experiences and requirements for GIR Brief lis>ng of one (of many possible)
More informationWDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud
WDCloud: An End to End System for Large- Scale Watershed Delineation on Cloud * In Kee Kim, * Jacob Steele, + Anthony Castronova, * Jonathan Goodall, and * Marty Humphrey * University of Virginia + Utah
More informationBehavioral Data Mining. Lecture 2
Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events
More informationFrom Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis
From Bases to Exemplars, from Separa2on to Understanding Paris Smaragdis paris@illinois.edu CS & ECE Depts. UNIVERSITY OF ILLINOIS AT URBANA CHAMPAIGN Some Mo2va2on Why do we separate signals? I don t
More informationPar$$oned Elias- Fano Indexes
Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3
More informationMaintaining Frequent Itemsets over High-Speed Data Streams
Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,
More informationMachine learning for Dynamic Social Network Analysis
Machine learning for Dynamic Social Network Analysis Manuel Gomez Rodriguez Max Planck Ins7tute for So;ware Systems UC3M, MAY 2017 Interconnected World SOCIAL NETWORKS TRANSPORTATION NETWORKS WORLD WIDE
More informationMarkov Chains and Hidden Markov Models
Markov Chains and Hidden Markov Models CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018 Soleymani Slides are based on Klein and Abdeel, CS188, UC Berkeley. Reasoning
More informationCollaborative topic models: motivations cont
Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.
More informationArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan
ArcGIS GeoAnalytics Server: An Introduction Sarah Ambrose and Ravi Narayanan Overview Introduction Demos Analysis Concepts using GeoAnalytics Server GeoAnalytics Data Sources GeoAnalytics Server Administration
More informationReal- Time Management of Complex Resource Alloca8on Systems: Necessity, Achievements and Further Challenges
Real- Time Management of Complex Resource Alloca8on Systems: Necessity, Achievements and Further Challenges Spyros Revelio8s School of Industrial & Systems Engineering Georgia Ins8tute of Technology DCDS
More informationThe use of GIS tools for analyzing eye- movement data
The use of GIS tools for analyzing eye- movement data Tomasz Opach Jan Ke>l Rød some facts In recent years an eye- tracking approach has become a common technique for collec>ng data in empirical user studies
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationA Convolutional Neural Network-based
A Convolutional Neural Network-based Model for Knowledge Base Completion Dat Quoc Nguyen Joint work with: Dai Quoc Nguyen, Tu Dinh Nguyen and Dinh Phung April 16, 2018 Introduction Word vectors learned
More informationWeb Structure Mining Nodes, Links and Influence
Web Structure Mining Nodes, Links and Influence 1 Outline 1. Importance of nodes 1. Centrality 2. Prestige 3. Page Rank 4. Hubs and Authority 5. Metrics comparison 2. Link analysis 3. Influence model 1.
More informationFREQUENT PATTERN SPACE MAINTENANCE: THEORIES & ALGORITHMS
FREQUENT PATTERN SPACE MAINTENANCE: THEORIES & ALGORITHMS FENG MENGLING SCHOOL OF ELECTRICAL & ELECTRONIC ENGINEERING NANYANG TECHNOLOGICAL UNIVERSITY 2009 FREQUENT PATTERN SPACE MAINTENANCE: THEORIES
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language
More informationAssocia'on Rule Mining
Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction
More informationLarge-Scale Behavioral Targeting
Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting
More informationIntroduction to ArcGIS Server Development
Introduction to ArcGIS Server Development Kevin Deege,, Rob Burke, Kelly Hutchins, and Sathya Prasad ESRI Developer Summit 2008 1 Schedule Introduction to ArcGIS Server Rob and Kevin Questions Break 2:15
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationIntroduction of Recruit
Apr. 11, 2018 Introduction of Recruit We provide various kinds of online services from job search to hotel reservations across the world. Housing Beauty Travel Life & Local O2O Education Automobile Bridal
More informationCSE 473: Ar+ficial Intelligence
CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationCSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.
CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationCri$ques Ø 5 cri&ques in total Ø Each with 6 points
Cri$ques Ø 5 cri&ques in total Ø Each with 6 points 1 Distributed Applica$on Alloca$on in Shared Sensor Networks Chengjie Wu, You Xu, Yixin Chen, Chenyang Lu Shared Sensor Network Example in San Francisco
More informationDiscovering Spatial and Temporal Links among RDF Data Panayiotis Smeros and Manolis Koubarakis
Discovering Spatial and Temporal Links among RDF Data Panayiotis Smeros and Manolis Koubarakis WWW2016 Workshop: Linked Data on the Web (LDOW2016) April 12, 2016 - Montréal, Canada Outline Introduction
More informationAssociation Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University
Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule
More informationInternal link prediction: a new approach for predicting links in bipartite graphs
Internal link prediction: a new approach for predicting links in bipartite graphs Oussama llali, lémence Magnien and Matthieu Latapy LIP6 NRS and Université Pierre et Marie urie (UPM Paris 6) 4 place Jussieu
More informationDecember 3, Dipartimento di Informatica, Università di Torino. Felicittà. Visualizing and Estimating Happiness in
: : Dipartimento di Informatica, Università di Torino December 3, 2013 : Outline : from social media Social media users generated contents represent the human behavior in everyday life, but... how to analyze
More informationRemoving trivial associations in association rule discovery
Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationEstimation of the Click Volume by Large Scale Regression Analysis May 15, / 50
Estimation of the Click Volume by Large Scale Regression Analysis Yuri Lifshits, Dirk Nowotka [International Computer Science Symposium in Russia 07, Ekaterinburg] Presented by: Aditya Menon UCSD May 15,
More informationSequential Feature Explanations for Anomaly Detection
Sequential Feature Explanations for Anomaly Detection Md Amran Siddiqui, Alan Fern, Thomas G. Die8erich and Weng-Keen Wong School of EECS Oregon State University Anomaly Detection Anomalies : points that
More informationNew Directions in Computer Science
New Directions in Computer Science John Hopcroft Cornell University Time of change The information age is a revolution that is changing all aspects of our lives. Those individuals, institutions, and nations
More informationSta$s$cal Significance Tes$ng In Theory and In Prac$ce
Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben Cartere8e University of Delaware h8p://ir.cis.udel.edu/ictir13tutorial Hypotheses and Experiments Hypothesis: Using an SVM for classifica$on will
More informationFielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data
Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data Nikita Zhiltsov 1,2 Alexander Kotov 3 Fedor Nikolaev 3 1 Kazan Federal University 2 Textocat 3 Textual Data Analytics
More informationA Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith
A Probabilistic Model for Canonicalizing Named Entity Mentions Dani Yogatama Yanchuan Sim Noah A. Smith Introduction Model Experiments Conclusions Outline Introduction Model Experiments Conclusions Outline
More informationAlgorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview
Algorithmic Methods of Data Mining, Fall 2005, Course overview 1 Course overview lgorithmic Methods of Data Mining, Fall 2005, Course overview 1 T-61.5060 Algorithmic methods of data mining (3 cp) P T-61.5060
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationGeodatabase An Overview
Federal GIS Conference February 9 10, 2015 Washington, DC Geodatabase An Overview Ralph Denkenberger - esri Session Path The Geodatabase - What is it? - Why use it? - What types are there? Inside the Geodatabase
More informationData Analytics Beyond OLAP. Prof. Yanlei Diao
Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of
More informationNATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar
NATURAL LANGUAGE PROCESSING Dr. G. Bharadwaja Kumar Sentence Boundary Markers Many natural language processing (NLP) systems generally take a sentence as an input unit part of speech (POS) tagging, chunking,
More informationMobility Analytics through Social and Personal Data. Pierre Senellart
Mobility Analytics through Social and Personal Data Pierre Senellart Session: Big Data & Transport Business Convention on Big Data Université Paris-Saclay, 25 novembre 2015 Analyzing Transportation and
More informationObjec&ves. Review. Data structure: Heaps Data structure: Graphs. What is a priority queue? What is a heap?
Objec&ves Data structure: Heaps Data structure: Graphs Submit problem set 2 1 Review What is a priority queue? What is a heap? Ø Proper&es Ø Implementa&on What is the process for finding the smallest element
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08
More informationLBT for Procedural and Reac1ve Systems Part 2: Reac1ve Systems Basic Theory
LBT for Procedural and Reac1ve Systems Part 2: Reac1ve Systems Basic Theory Karl Meinke, karlm@kth.se School of Computer Science and Communica:on KTH Stockholm 0. Overview of the Lecture 1. Learning Based
More informationComputer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13
Computer Vision Pa0ern Recogni4on Concepts Part I Luis F. Teixeira MAP- i 2012/13 What is it? Pa0ern Recogni4on Many defini4ons in the literature The assignment of a physical object or event to one of
More informationStatistical Methods for NLP
Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationGraphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum
Graphical Models Lecture 1: Mo4va4on and Founda4ons Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. Board work Expert systems the desire for probability
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More informationCS 6140: Machine Learning Spring What We Learned Last Week 2/26/16
Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra
More informationDEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery
DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery Yingjie Hu, Grant McKenzie, Jiue-An Yang, Song Gao, Amin Abdalla, and
More informationSTATISTICAL PERFORMANCE
STATISTICAL PERFORMANCE PROVISIONING AND ENERGY EFFICIENCY IN DISTRIBUTED COMPUTING SYSTEMS Nikzad Babaii Rizvandi 1 Supervisors: Prof.Albert Y.Zomaya Prof. Aruna Seneviratne OUTLINE Introduction Background
More informationProject 2: Hadoop PageRank Cloud Computing Spring 2017
Project 2: Hadoop PageRank Cloud Computing Spring 2017 Professor Judy Qiu Goal This assignment provides an illustration of PageRank algorithms and Hadoop. You will then blend these applications by implementing
More informationMETA: An Efficient Matching-Based Method for Error-Tolerant Autocompletion
: An Efficient Matching-Based Method for Error-Tolerant Autocompletion Dong Deng Guoliang Li He Wen H. V. Jagadish Jianhua Feng Department of Computer Science, Tsinghua National Laboratory for Information
More informationMobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL
MobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL Jing (Selena) He Department of Computer Science, Kennesaw State University Shouling Ji,
More informationMinimum Edit Distance. Defini'on of Minimum Edit Distance
Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso
Machine Learning Data Mining CS/CNS/EE 155 Lecture 4: Recent Applica6ons of Lasso 1 Today: Two Recent Applica6ons Cancer Detec0on Personaliza0on via twi9er music( Biden( soccer( Labour( Applica6ons of
More informationReductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York
Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association
More informationSparse and Robust Optimization and Applications
Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability
More information