Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Size: px
Start display at page:

Download "Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)"

Transcription

1 Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

2 References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri, 7th Ed. Chapter 27, section 5. 2

3 Hashing Techniques (1) High efficient indexes for equality search. They could be used both for internal or external files. We could classify our techniques in other two types: Static Hashing: for fixed size non-mutable data utilizzabile (e.g. single session CD-ROM/DVD/BlueRay). Hashing for Dynamic File Expansion: when both data and data sizes could vary in time. As for the indices for secondary memory trees, hash indices have use blocks for storing buckets. 3

4 Hashing Techniques (2) The information could be accessed fast: another type of primary file organization. Value search Bucketing (see hash join) We want to examine an arbitrary position in an array in O(1) time. For some hash data structures, in the worse case scenario the seek time is Θ(n) Allow to implement the dictionary operations (Insert, Search, Delete). 4

5 Hashing Functions We want to store our information into a given number of records using blocks (and buckets). The search condition involves a (single) hash field. In most cases, such field is a key field of the file, in which case it is called hash key. We define a hashing function mapping each value into a range 0..M-1. e.g.: h(k)=k mod M If K is non-numeric, its byte representation is used instead 5

6 Collision Resolution It implies that some collisions may happed (exists k and k s.t. h(k)=h(k )) that have to be resolved (collision resolutions). Collision resolutions are the following: Chaining: the bucket array is exteded with overflow. Open addressing: from the occupied position, we seek the next free position. Multiple hashing: if one hash function generates a collision, we try with a next function We re going to see only the first technique. 6

7 Static Hashing A fixed number of buckets M is allocated, and an hashing function mapping K to [0,,M-1]. Each bucket has at least one block. Each block is composed by m records. E.g., H(K) returns the first/last M bits of K s byte representation. In this case, i = 1, N = 2 i = 2, and each bucket contains a block 0 record blocks bucket

8 Static Hashing for Internal Files 8

9 Static Hashing for Internal Files 9

10 Static Hashing: searching 1100 The hash function computes the array index where the records with a given hash field are stored 1100 H Hash Function 10

11 Ricerca di 1100 con hashing statico The hash function computes the array index where the records with a given hash field are stored 1100 H Hash Function 11

12 Static Hashing: Insert() We use the chaining method to extend a block with overflow positions. H

13 Static Hashing: Insert() We use the chaining method to extend a block with overflow positions. H

14 Static Hashing: Insert() We use the chaining method to extend a block with overflow positions. H

15 How many blocks are there? A 0 B 1 4 C 2 D

16 How many buckets are there? A 0 B 1 4 C 2 D

17 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows H

18 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows H

19 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows H

20 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows

21 Static Hashing: Efficiency(?) When no overflow occur, buckets could be accessed with a single access. Then, m scans have to be performend if we want to reach a specific record. Overflows cause a quick performance degradation Efficency depends on: Index size vs. Data size ratio, number of buckets. Uniform hashing (e.g. H(k) is not a constant function ). H(K) could vary at run time in order to reduce the number of overflows 21

22 Dynamic File Expansion Collision resolution could be avoided by changing the number of the buckets dynamically. Extendible Hashing: bucekts are used through an extendible directory, storing pointers to buckets. Overflows are not used. The indexed buckets increase exponentially. Linear Hashing: No directory are used. Overflows are still used. The indexed buckets increase linearly. Dynamic Hashing: (A precursor of extendible hashing, this argument will not be dealt with in the) 22

23 External Hashing: a general example 23

24 Extendible Hashing (1) Uses a level of indirection through an array of pointers to buckets (directory). The directory grows by doubling their size. It has size 2 d, increasing linearly by d (global depth). Each entry in the directory points to a single bucket, but each bucket can be accessed from multiple entries. Each bucket contains a d variable (local depth) that indicates how many bits are actually used to index it. 24

25 Extendible Hashing (2) The hash function refreshes alongside with d: it returns the most significant d bits of the binary coded search key. Extendible hashing doesn t use overflow blocks, as the data structure is dynamically updated. 25

26 Extendible Hashing: Search(1100) global depth d= H 0 1 directory

27 Extendible Hashing: Search(1100) global depth d= H 0 1 directory

28 Extendible Hashing: Insert Retrive the bucket where to store the value, if there is enough room, store it! Otherwise, get d. If d < d The block is halved (halving). Keys are distributed within the halved blocks using d + 1 bits. d' is incremented to d'+1 The directory is updated with the pointer to the newly-created block If d = d d is incremented to d+1 (doubling). Each directory entry is doubled, such that each entry w produces two entries, w 0 and w 1. Continue as previously stated 28

29 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value d= H

30 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value d= H There is no room! 30

31 Extendible Hashing: Insert() (2) Compare d' and d d=1 i= H d = d 31

32 Extendible Hashing: Insert() (3) Si incrementa Increase d i by di 1 d= H

33 Extendible Hashing: Insert() (3) Each directory entry is doubled, such that each entry w produces two entries, w 0 and w 1. d= H

34 Extendible Hashing: Insert() (4) The block is halved d=

35 Extendible Hashing: Insert() (4) The block is halved d= SPLIT 1 35

36 Extendible Hashing: Insert() (4) The block is halved Keys are distributed within the halved blocks using d +1 bits. d= SPLIT

37 Extendible Hashing: Insert() (4) The block is halved Keys are distributed within the halved blocks using d +1 bits d' is increased by one d= SPLIT

38 Extendible Hashing: Insert() (4) The block is halved Keys are distributed within the halved blocks using d +1 bits d' is increased by one The directory is updated with the pointer to the newly-created block d= SPLIT

39 Extendible Hashing: Insert(0100) Retrive the bucket where to store the value 0100 H d=

40 Extendible Hashing: Insert(0100) Retrive the bucket where to store the value 0100 H d=

41 Extendible Hashing: Insert(0100) Retrive the bucket where to store the value 0100 H d= There is enough room, insert it! 41

42 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value H d=

43 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value H d= There is no room! 43

44 Extendible Hashing: Insert() (2) Si confronta Compare j d' con and i d H d= d < d 44

45 Extendible Hashing: Insert() (3) The block is halved d=

46 Extendible Hashing: Insert() (3) The block is halved d= SPLIT

47 Extendible Hashing: Insert() (3) The block is halved Keys are distributed within the halved blocks using d +1 bits. d= SPLIT 47

48 Extendible Hashing: Insert() (3) The block is halved Keys are distributed within the halved blocks using d +1 bits. d' is increased by one d= SPLIT 48

49 Extendible Hashing: Insert() (3) The block is halved Keys are distributed within the halved blocks using d + 1 bits. d' is increased by one The directory is updated with the pointer to the newly-created block d= SPLIT 49

50 Hashing estendibile: inserimento 1000 (1) Retrive the bucket where to store the value H d=

51 Hashing estendibile: inserimento 1000 (1) Retrive the bucket where to store the value H d= There is no room! 51

52 Hashing estendibile: inserimento 1000 (2) Compare d' and d H d= d = d 52

53 Hashing estendibile: inserimento 1000 (3) Increase d by H d=

54 Hashing estendibile: inserimento 1000 (4) Increase d by 1 Each directory entry is doubled, such that each entry w produces two entries, w 0 e w H d=3 i=

55 Hashing estendibile: inserimento 1000 (4) The block is halved d=

56 Hashing estendibile: inserimento 1000 (4) The block is halved d= SPLIT 56

57 Hashing estendibile: inserimento 1000 (4) The block is halved Keys are distributed within the halved blocks using d +1 bits d= SPLIT 57

58 Hashing estendibile: inserimento 1000 (4) The block is halved Keys are distributed within the halved blocks using d +1 bits. d' is increased by one d= SPLIT 58

59 Hashing estendibile: inserimento 1000 (5) The directory is updated with the pointer to the newly-created block d=

60 Hashing estendibile: inserimento 1000 (5) The directory is updated with the pointer to the newly-created block d=

61 How many blocks are there? A 3 B 5 C 6 D 8 d=

62 How many buckets are there? A 3 B 5 C 6 D 8 d=

63 Extendible hashing: discussion Pros: The space overhead for the directory is negligible Splitting causes minor reorganizations, since only the records in one bucket are redistributed to the two new buckets (e.g. records that start with the same bit sequence). Cons: Directory must be searched before accessing the buckets themselves: we have two block accesses (directory+data file buckets). This penalty is considered minor. The exponential memory increase in first memory is quite inefficient (it allocates more space than the one actually required) 63

64 Linear Hashing Hash file expands and shrinks without needing a directory. Overflow blocks are allowed. The number of buckets increases linearly. Such increase happens when (i) a overflow block is inserted or (ii) when a specific record-bucket ratio (file load factor, r/n) exeeds a guard value l (e.g. l=1.7). H i (K) returns the less significant i bits. 64

65 Linear Hashing: Search() Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n =

66 Linear Hashing: Search() Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=2 n=3 r=4 r/n = 1.33 H 2 () = 01 2 = 1 10 < 3 66

67 Linear Hashing: Search() Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=2 n=3 r=4 r/n = 1.33 H 2 () = 01 2 = 1 10 < 3 67

68 Linear Hashing: Search(1111) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n =

69 Linear Hashing: Search(1111) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H (2-1) = 1 10 = 01 2 i=2 n=3 r=4 r/n = 1.33 H 2 (1111) = 11 2 =

70 Linear Hashing: Search(1111) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H (2-1) = 1 10 = 01 2 i=2 n=3 r=4 r/n = 1.33 H 2 (1111) = 11 2 =

71 Linear Hashing: Insert (1) r records, n buckets, file load factor (r/n), guard l When the file load factor is exceeded, a bucket gets splitted and a new n+1 bucket is created. While using the hashing function H i, all the buckets up to the 2 i-1 -th are splitted. When n>2 i, H i is changed with H i+1 and the buckets are splitted, once again, from the first one 71

72 Linear Hashing: Insert (2) if H i (k) = m < n: store k in the m-th bucket (use overflows when necessary) Otherwise store k in the (m - 2 i -1 )-th bucket Increase r, if the r/n>l then: if n = 2 i, increase i by 1 (n) 2 =a 1 a 2 a i having a 1 =1 Clear the first bit in n and store it to m. a 1 a 2 a i 0a 2 a i Allocate the n-th block Move all the records from block m having the i-th rightmost bit as 1 into the n-th block. Increase n 72

73 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=1 n=2 r=3 r/n =

74 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=1 n=2 r=3 r/n = 1.5 H 1 () = 1 2 = 1 10 < 2 74

75 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=1 n=2 r=3 r/n = 1.5 H 1 () = 1 2 = 1 10 < 2 75

76 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r H i=1 n=2 r=4 r/n = >1.7 H 1 () = 1 2 = 1 10 < 2 76

77 Linear Hashing: Insert() (2) i=1 n=2 r=4 r/n = 2 77

78 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 = i=2 n=2 r=4 r/n = 2 78

79 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 = i=2 n=2 r=4 r/n =

80 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 =10 Move all the records from block m having the second rightmost bit as 1 into the n-th block. i=2 n=2 r= r/n = 2 80

81 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 =10 Move all the records from block m having the second rightmost bit as 1 into the n-th block. i=2 Increase n n=3 r= r/n = 2 81

82 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 =10 Move all the records from block m having the second rightmost bit as 1 into the n-th block. i=2 Increase n n=3 r= r/n = r/n 1.33 = <

83 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n =

84 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 84

85 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 85

86 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 86

87 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r 0001 H i=2 n=3 r=5 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 87

88 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r 0001 H i=2 n=3 r=5 r/n r/n = = < 1.7 H 2 (0001) = 01 2 = 1 10 < 3 88

89 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket i=2 n=3 r=5 r/n = H

90 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket i=2 n=3 r=5 r/n = H H 2 (0110) = 10 2 = 2 < 3 90

91 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket i=2 n=3 r=5 r/n = H H 2 (0110) = 10 2 = 2 < 3 91

92 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r 0110 H i=2 n=3 r=6 r/n r/n = 2 = > H 2 (0110) = 10 2 = 2 < 3 92

93 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 = i=2 n=3 r=6 r/n = 2 93

94 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = i=2 n=3 r=6 r/n = 2 94

95 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block i=2 n=3 r=6 r/n = 2 95

96 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block i=2 n=3 r=6 r/n = 2 96

97 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block. Increase n i=2 n= r= r/n = 2 97

98 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block. Increase n i=2 n= r= r/n r/n = 1.5 = 2<

99 How many blocks are there? A 2 B 3 C 4 D i=2 n=3 r=

100 How many buckets are there? A 2 B 3 C 4 D i=2 n=3 r=

101 Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). These days we frequently think first of web search, but there are many other cases: search Searching your laptop Corporate knowledge bases Domain specific search (i.e. Legal information retrieval) Graph databases

102 Basic Assumptions Document: unstructured file Collection: A set of documents Assume it is a static, non-hypertext collection for the moment Query: set (of sequence of) of keywords to express an information need Goal: Retrieve documents with information that is relevant to the user s information need and helps the user complete a task

103 Inverted indexes Inverted indexes are used to efficiently retrieve unstructured documents through fulltext queries. Such indices are inverted because the index all the documents where a given word appears. A document collection. D={d 1,,d n } A vocabulary V is a set of distinct terms in the document set An inverted index IX of a document collection is a vocabulary that attaches distinct terms with a list of all documents that contains the term. for each v in V for each d i in D s.t. v d i IX[v] = { (Di,j) d i [j] = v } 103

104 Generating the Inverted Index (1) word number D1 D2 Una mattina, svegliandosi da sogni inquieti, Gregor Samsa si trovò nel suo letto trasformato in un insetto mostruoso Voi che trovate tornando a casa il cibo caldo e visi amici Considerate se questo è un uomo da 4 Gregor 7 in 15 un 16 a 5 casa 6 che 2 D3 vidi un magnifico disegno. Rappresentava un serpente boa nell atto di inghiottire un animale animale 14 atto 10 boa 8 un 2, 6,

105 Generating the Inverted Index (1) a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) da 4 Gregor 7 in 15 un 16 a 5 casa 6 che 2 animale 14 atto 10 boa 8 un 2, 6,

106 Inverted Index: queries (1) Return documents containing un AND atto. a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) D3 D1, D3 D3 D3 vidi un magnifico disegno. Rappresentava un serpente boa nell atto di inghiottire un animale 106

107 Stemming Reduce terms to their roots before indexing and reduces the size of the inverted index. Stemming suggests crude postfix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress

108 Stop words With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: They have little semantic content: the, a, and, to, be There are very frequent: ~30% of postings for top 30 words But the trend is away from doing this: It significantly reduces the size of an inverted index It allows query optimizations because the machine-representation of a query won t include such words. On the other hand, you need them for: Sentence Queries: King of Denmark Various song titles, etc.: Let it be, To be or not to be Relational queries: flights to London

109 Inverted Index: queries (2) Return documents containing un atto. a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) (D3,10) (D1,16), (D3,2), (D3,6), (D3,13) D3: 2, 10 6, 10 13,

110 Inverted Index: queries (3) Return documents containing un animale. a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) (D3,14) (D1,16), (D3,2), (D3,6), (D3,13) D3: 2, 14 6, 14 13, 14 D3 vidi un magnifico disegno. Rappresentava un serpente boa nell atto di inghiottire un animale 110

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Advanced Implementations of Tables: Balanced Search Trees and Hashing Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the

More information

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST Collision Kuan-Yu Chen ( 陳冠宇 ) 2018/12/17 @ TR-212, NTUST Review Hash table is a data structure in which keys are mapped to array positions by a hash function When two or more keys map to the same memory

More information

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated

More information

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park 1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution

More information

Data Structures and Algorithm. Xiaoqing Zheng

Data Structures and Algorithm. Xiaoqing Zheng Data Structures and Algorithm Xiaoqing Zheng zhengxq@fudan.edu.cn MULTIPOP top[s] = 6 top[s] = 2 3 2 8 5 6 5 S MULTIPOP(S, x). while not STACK-EMPTY(S) and k 0 2. do POP(S) 3. k k MULTIPOP(S, 4) Analysis

More information

Fundamental Algorithms

Fundamental Algorithms Chapter 5: Hash Tables, Winter 2018/19 1 Fundamental Algorithms Chapter 5: Hash Tables Jan Křetínský Winter 2018/19 Chapter 5: Hash Tables, Winter 2018/19 2 Generalised Search Problem Definition (Search

More information

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM Hashing Hashing Start with an array that holds the hash table. Use a hash function to take a key and map it to some index in the array. If the desired record is in the location given by the index, then

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

The Boolean Model ~1955

The Boolean Model ~1955 The Boolean Model ~1955 The boolean model is the first, most criticized, and (until a few years ago) commercially more widespread, model of IR. Its functionalities can often be found in the Advanced Search

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

Lecture 4: Divide and Conquer: van Emde Boas Trees

Lecture 4: Divide and Conquer: van Emde Boas Trees Lecture 4: Divide and Conquer: van Emde Boas Trees Series of Improved Data Structures Insert, Successor Delete Space This lecture is based on personal communication with Michael Bender, 001. Goal We want

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30 CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Algorithms lecture notes 1. Hashing, and Universal Hash functions Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.

More information

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data 1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect

More information

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

More information

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class Dr. Thomas E. Hicks Data Abstractions Homework - Hashing -1 - INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University Data Set - SSN's from UTSA Class 467 13 3881 498 66 2055 450 27 3804 456 49 5261

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9 Constant time access Searching Chapter 9 Linear search Θ(n) OK Binary search Θ(log n) Better Can we achieve Θ(1) search time? CPTR 318 1 2 Use an array? Use random access on a key such as a string? Hash

More information

Graduate Analysis of Algorithms Dr. Haim Levkowitz

Graduate Analysis of Algorithms Dr. Haim Levkowitz UMass Lowell Computer Science 9.53 Graduate Analysis of Algorithms Dr. Haim Levkowitz Fall 27 Lecture 5 Tuesday, 2 Oct 27 Amortized Analysis Overview Amortize: To pay off a debt, usually by periodic payments

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Information Retrieval Using Boolean Model SEEM5680

Information Retrieval Using Boolean Model SEEM5680 Information Retrieval Using Boolean Model SEEM5680 1 Unstructured (text) vs. structured (database) data in 1996 2 2 Unstructured (text) vs. structured (database) data in 2009 3 3 The problem of IR Goal

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Searching, mainly via Hash tables

Searching, mainly via Hash tables Data structures and algorithms Part 11 Searching, mainly via Hash tables Petr Felkel 26.1.2007 Topics Searching Hashing Hash function Resolving collisions Hashing with chaining Open addressing Linear Probing

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

CSCE 561 Information Retrieval System Models

CSCE 561 Information Retrieval System Models CSCE 561 Information Retrieval System Models Satya Katragadda 26 August 2015 Agenda Introduction to Information Retrieval Inverted Index IR System Models Boolean Retrieval Model 2 Introduction Information

More information

Hashing Data Structures. Ananda Gunawardena

Hashing Data Structures. Ananda Gunawardena Hashing 15-121 Data Structures Ananda Gunawardena Hashing Why do we need hashing? Many applications deal with lots of data Search engines and web pages There are myriad look ups. The look ups are time

More information

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple A Lecture on Hashing Aram-Alexandre Pooladian, Alexander Iannantuono March 22, 217 This is the scribing of a lecture given by Luc Devroye on the 17th of March 217 for Honours Algorithms and Data Structures

More information

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002) COMP251: Hashing Jérôme Waldispühl School of Computer Science McGill University Based on (Cormen et al., 2002) Table S with n records x: Problem DefiniNon X Key[x] InformaNon or data associated with x

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109 1 Chris Piech CS 109 Counting Lecture Notes #1 Sept 24, 2018 Based on a handout by Mehran Sahami with examples by Peter Norvig Although you may have thought you had a pretty good grasp on the notion of

More information

Today: Amortized Analysis

Today: Amortized Analysis Today: Amortized Analysis COSC 581, Algorithms March 6, 2014 Many of these slides are adapted from several online sources Today s class: Chapter 17 Reading Assignments Reading assignment for next class:

More information

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a Hash Tables Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a mapping from U to M = {1,..., m}. A collision occurs when two hashed elements have h(x) =h(y).

More information

Hashing. Data organization in main memory or disk

Hashing. Data organization in main memory or disk Hashing Data organization in main memory or disk sequential, binary trees, The location of a key depends on other keys => unnecessary key comparisons to find a key Question: find key with a single comparison

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

More information

Hash tables. Hash tables

Hash tables. Hash tables Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X

More information

Module 1: Analyzing the Efficiency of Algorithms

Module 1: Analyzing the Efficiency of Algorithms Module 1: Analyzing the Efficiency of Algorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu What is an Algorithm?

More information

So far we have implemented the search for a key by carefully choosing split-elements.

So far we have implemented the search for a key by carefully choosing split-elements. 7.7 Hashing Dictionary: S. insert(x): Insert an element x. S. delete(x): Delete the element pointed to by x. S. search(k): Return a pointer to an element e with key[e] = k in S if it exists; otherwise

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35 boolean model January 15, 2017 1 / 35 Outline 1 boolean queries 2 3 4 2 / 35 taxonomy of IR models Set theoretic fuzzy extended boolean set-based IR models Boolean vector probalistic algebraic generalized

More information

Lecture Notes for Chapter 17: Amortized Analysis

Lecture Notes for Chapter 17: Amortized Analysis Lecture Notes for Chapter 17: Amortized Analysis Chapter 17 overview Amortized analysis Analyze a sequence of operations on a data structure. Goal: Show that although some individual operations may be

More information

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1 Hash Tables Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing CS 5633 Analysis of Algorithms Chapter 11: Slide 1 Direct-Address Tables 2 2 Let U = {0,...,m 1}, the set of

More information

Dictionary: an abstract data type

Dictionary: an abstract data type 2-3 Trees 1 Dictionary: an abstract data type A container that maps keys to values Dictionary operations Insert Search Delete Several possible implementations Balanced search trees Hash tables 2 2-3 trees

More information

Amortized Analysis (chap. 17)

Amortized Analysis (chap. 17) Amortized Analysis (chap. 17) Not just consider one operation, but a sequence of operations on a given data structure. Average cost over a sequence of operations. Probabilistic analysis: Average case running

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Configuring Spatial Grids for Efficient Main Memory Joins

Configuring Spatial Grids for Efficient Main Memory Joins Configuring Spatial Grids for Efficient Main Memory Joins Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki École Polytechnique Fédérale de Lausanne (EPFL), Imperial College London Abstract. The performance

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle

More information

Lecture 3: Probabilistic Retrieval Models

Lecture 3: Probabilistic Retrieval Models Probabilistic Retrieval Models Information Retrieval and Web Search Engines Lecture 3: Probabilistic Retrieval Models November 5 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme

More information

IR: Information Retrieval

IR: Information Retrieval / 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC

More information

Hashing. Martin Babka. January 12, 2011

Hashing. Martin Babka. January 12, 2011 Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice

More information

Hash-based Indexing: Application, Impact, and Realization Alternatives

Hash-based Indexing: Application, Impact, and Realization Alternatives : Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider

More information

Singular Value Decompsition

Singular Value Decompsition Singular Value Decompsition Massoud Malek One of the most useful results from linear algebra, is a matrix decomposition known as the singular value decomposition It has many useful applications in almost

More information

Lecture 8 HASHING!!!!!

Lecture 8 HASHING!!!!! Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)

More information

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)

More information

Divide-and-Conquer. Consequence. Brute force: n 2. Divide-and-conquer: n log n. Divide et impera. Veni, vidi, vici.

Divide-and-Conquer. Consequence. Brute force: n 2. Divide-and-conquer: n log n. Divide et impera. Veni, vidi, vici. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part recursively. Combine solutions to sub-problems into overall solution. Most common usage. Break up problem of

More information

Querying. 1 o Semestre 2008/2009

Querying. 1 o Semestre 2008/2009 Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

CS4800: Algorithms & Data Jonathan Ullman

CS4800: Algorithms & Data Jonathan Ullman CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018 Data Compression How do we store strings of text compactly? A (binary) code

More information

N/4 + N/2 + N = 2N 2.

N/4 + N/2 + N = 2N 2. CS61B Summer 2006 Instructor: Erin Korber Lecture 24, 7 Aug. 1 Amortized Analysis For some of the data structures we ve discussed (namely hash tables and splay trees), it was claimed that the average time

More information

1 Hashing. 1.1 Perfect Hashing

1 Hashing. 1.1 Perfect Hashing 1 Hashing Hashing is covered by undergraduate courses like Algo I. However, there is much more to say on this topic. Here, we focus on two selected topics: perfect hashing and cockoo hashing. In general,

More information

? 11.5 Perfect hashing. Exercises

? 11.5 Perfect hashing. Exercises 11.5 Perfect hashing 77 Exercises 11.4-1 Consider inserting the keys 10; ; 31; 4; 15; 8; 17; 88; 59 into a hash table of length m 11 using open addressing with the auxiliary hash function h 0.k/ k. Illustrate

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Predecessor data structures We want to support

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering

More information

Data Structures in Java

Data Structures in Java Data Structures in Java Lecture 20: Algorithm Design Techniques 12/2/2015 Daniel Bauer 1 Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways of

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

md5bloom: Forensic Filesystem Hashing Revisited

md5bloom: Forensic Filesystem Hashing Revisited DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from

More information

Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn

Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn Priority Queue / Heap Stores (key,data) pairs (like dictionary) But, different set of operations: Initialize-Heap: creates new empty heap

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

Lecture 1b: Text, terms, and bags of words

Lecture 1b: Text, terms, and bags of words Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection

More information

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing Hashing Dictionaries Hashing with chaining Hash functions Linear Probing Hashing Dictionaries Hashing with chaining Hash functions Linear Probing Dictionaries Dictionary: Maintain a dynamic set S. Every

More information

CS483 Design and Analysis of Algorithms

CS483 Design and Analysis of Algorithms CS483 Design and Analysis of Algorithms Lectures 2-3 Algorithms with Numbers Instructor: Fei Li lifei@cs.gmu.edu with subject: CS483 Office hours: STII, Room 443, Friday 4:00pm - 6:00pm or by appointments

More information

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b) Introduction to Algorithms October 14, 2009 Massachusetts Institute of Technology 6.006 Spring 2009 Professors Srini Devadas and Constantinos (Costis) Daskalakis Quiz 1 Solutions Quiz 1 Solutions Problem

More information

Introduction to Hashtables

Introduction to Hashtables Introduction to HashTables Boise State University March 5th 2015 Hash Tables: What Problem Do They Solve What Problem Do They Solve? Why not use arrays for everything? 1 Arrays can be very wasteful: Example

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

4 Locality-sensitive hashing using stable distributions

4 Locality-sensitive hashing using stable distributions 4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family

More information

Amortized analysis. Amortized analysis

Amortized analysis. Amortized analysis In amortized analysis the goal is to bound the worst case time of a sequence of operations on a data-structure. If n operations take T (n) time (worst case), the amortized cost of an operation is T (n)/n.

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13 CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic

More information

6.854 Advanced Algorithms

6.854 Advanced Algorithms 6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 2: Distributed Database Design Logistics Gradiance No action items for now Detailed instructions coming shortly First quiz to be released

More information

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia? Query CS347 Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia? Lecture 1 April 4, 2001 Prabhakar Raghavan Term-document incidence Incidence vectors Antony and Cleopatra Julius

More information

6.830 Lecture 11. Recap 10/15/2018

6.830 Lecture 11. Recap 10/15/2018 6.830 Lecture 11 Recap 10/15/2018 Celebration of Knowledge 1.5h No phones, No laptops Bring your Student-ID The 5 things allowed on your desk Calculator allowed 4 pages (2 pages double sided) of your liking

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 16 October 30, 2017 CPSC 467, Lecture 16 1/52 Properties of Hash Functions Hash functions do not always look random Relations among

More information

Dictionary: an abstract data type

Dictionary: an abstract data type 2-3 Trees 1 Dictionary: an abstract data type A container that maps keys to values Dictionary operations Insert Search Delete Several possible implementations Balanced search trees Hash tables 2 2-3 trees

More information

Count-Min Tree Sketch: Approximate counting for NLP

Count-Min Tree Sketch: Approximate counting for NLP Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26

More information

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /

More information