Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Size: px

Start display at page:

Download "Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)"

Lorena Mason
5 years ago
Views:

1 Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

2 References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri, 7th Ed. Chapter 27, section 5. 2

3 Hashing Techniques (1) High efficient indexes for equality search. They could be used both for internal or external files. We could classify our techniques in other two types: Static Hashing: for fixed size non-mutable data utilizzabile (e.g. single session CD-ROM/DVD/BlueRay). Hashing for Dynamic File Expansion: when both data and data sizes could vary in time. As for the indices for secondary memory trees, hash indices have use blocks for storing buckets. 3

4 Hashing Techniques (2) The information could be accessed fast: another type of primary file organization. Value search Bucketing (see hash join) We want to examine an arbitrary position in an array in O(1) time. For some hash data structures, in the worse case scenario the seek time is Θ(n) Allow to implement the dictionary operations (Insert, Search, Delete). 4

5 Hashing Functions We want to store our information into a given number of records using blocks (and buckets). The search condition involves a (single) hash field. In most cases, such field is a key field of the file, in which case it is called hash key. We define a hashing function mapping each value into a range 0..M-1. e.g.: h(k)=k mod M If K is non-numeric, its byte representation is used instead 5

6 Collision Resolution It implies that some collisions may happed (exists k and k s.t. h(k)=h(k )) that have to be resolved (collision resolutions). Collision resolutions are the following: Chaining: the bucket array is exteded with overflow. Open addressing: from the occupied position, we seek the next free position. Multiple hashing: if one hash function generates a collision, we try with a next function We re going to see only the first technique. 6

7 Static Hashing A fixed number of buckets M is allocated, and an hashing function mapping K to [0,,M-1]. Each bucket has at least one block. Each block is composed by m records. E.g., H(K) returns the first/last M bits of K s byte representation. In this case, i = 1, N = 2 i = 2, and each bucket contains a block 0 record blocks bucket

8 Static Hashing for Internal Files 8

9 Static Hashing for Internal Files 9

10 Static Hashing: searching 1100 The hash function computes the array index where the records with a given hash field are stored 1100 H Hash Function 10

11 Ricerca di 1100 con hashing statico The hash function computes the array index where the records with a given hash field are stored 1100 H Hash Function 11

12 Static Hashing: Insert() We use the chaining method to extend a block with overflow positions. H

13 Static Hashing: Insert() We use the chaining method to extend a block with overflow positions. H

14 Static Hashing: Insert() We use the chaining method to extend a block with overflow positions. H

15 How many blocks are there? A 0 B 1 4 C 2 D

16 How many buckets are there? A 0 B 1 4 C 2 D

17 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows H

18 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows H

19 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows H

20 Static Hashing: Delete(1100) Overflows degradate static hashing's efficiency. Value deletion could reduce the number of overflows

21 Static Hashing: Efficiency(?) When no overflow occur, buckets could be accessed with a single access. Then, m scans have to be performend if we want to reach a specific record. Overflows cause a quick performance degradation Efficency depends on: Index size vs. Data size ratio, number of buckets. Uniform hashing (e.g. H(k) is not a constant function ). H(K) could vary at run time in order to reduce the number of overflows 21

22 Dynamic File Expansion Collision resolution could be avoided by changing the number of the buckets dynamically. Extendible Hashing: bucekts are used through an extendible directory, storing pointers to buckets. Overflows are not used. The indexed buckets increase exponentially. Linear Hashing: No directory are used. Overflows are still used. The indexed buckets increase linearly. Dynamic Hashing: (A precursor of extendible hashing, this argument will not be dealt with in the) 22

23 External Hashing: a general example 23

24 Extendible Hashing (1) Uses a level of indirection through an array of pointers to buckets (directory). The directory grows by doubling their size. It has size 2 d, increasing linearly by d (global depth). Each entry in the directory points to a single bucket, but each bucket can be accessed from multiple entries. Each bucket contains a d variable (local depth) that indicates how many bits are actually used to index it. 24

25 Extendible Hashing (2) The hash function refreshes alongside with d: it returns the most significant d bits of the binary coded search key. Extendible hashing doesn t use overflow blocks, as the data structure is dynamically updated. 25

26 Extendible Hashing: Search(1100) global depth d= H 0 1 directory

27 Extendible Hashing: Search(1100) global depth d= H 0 1 directory

28 Extendible Hashing: Insert Retrive the bucket where to store the value, if there is enough room, store it! Otherwise, get d. If d < d The block is halved (halving). Keys are distributed within the halved blocks using d + 1 bits. d' is incremented to d'+1 The directory is updated with the pointer to the newly-created block If d = d d is incremented to d+1 (doubling). Each directory entry is doubled, such that each entry w produces two entries, w 0 and w 1. Continue as previously stated 28

29 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value d= H

30 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value d= H There is no room! 30

31 Extendible Hashing: Insert() (2) Compare d' and d d=1 i= H d = d 31

32 Extendible Hashing: Insert() (3) Si incrementa Increase d i by di 1 d= H

33 Extendible Hashing: Insert() (3) Each directory entry is doubled, such that each entry w produces two entries, w 0 and w 1. d= H

34 Extendible Hashing: Insert() (4) The block is halved d=

35 Extendible Hashing: Insert() (4) The block is halved d= SPLIT 1 35

36 Extendible Hashing: Insert() (4) The block is halved Keys are distributed within the halved blocks using d +1 bits. d= SPLIT

37 Extendible Hashing: Insert() (4) The block is halved Keys are distributed within the halved blocks using d +1 bits d' is increased by one d= SPLIT

38 Extendible Hashing: Insert() (4) The block is halved Keys are distributed within the halved blocks using d +1 bits d' is increased by one The directory is updated with the pointer to the newly-created block d= SPLIT

39 Extendible Hashing: Insert(0100) Retrive the bucket where to store the value 0100 H d=

40 Extendible Hashing: Insert(0100) Retrive the bucket where to store the value 0100 H d=

41 Extendible Hashing: Insert(0100) Retrive the bucket where to store the value 0100 H d= There is enough room, insert it! 41

42 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value H d=

43 Extendible Hashing: Insert() (1) Retrive the bucket where to store the value H d= There is no room! 43

44 Extendible Hashing: Insert() (2) Si confronta Compare j d' con and i d H d= d < d 44

45 Extendible Hashing: Insert() (3) The block is halved d=

46 Extendible Hashing: Insert() (3) The block is halved d= SPLIT

47 Extendible Hashing: Insert() (3) The block is halved Keys are distributed within the halved blocks using d +1 bits. d= SPLIT 47

48 Extendible Hashing: Insert() (3) The block is halved Keys are distributed within the halved blocks using d +1 bits. d' is increased by one d= SPLIT 48

49 Extendible Hashing: Insert() (3) The block is halved Keys are distributed within the halved blocks using d + 1 bits. d' is increased by one The directory is updated with the pointer to the newly-created block d= SPLIT 49

50 Hashing estendibile: inserimento 1000 (1) Retrive the bucket where to store the value H d=

51 Hashing estendibile: inserimento 1000 (1) Retrive the bucket where to store the value H d= There is no room! 51

52 Hashing estendibile: inserimento 1000 (2) Compare d' and d H d= d = d 52

53 Hashing estendibile: inserimento 1000 (3) Increase d by H d=

54 Hashing estendibile: inserimento 1000 (4) Increase d by 1 Each directory entry is doubled, such that each entry w produces two entries, w 0 e w H d=3 i=

55 Hashing estendibile: inserimento 1000 (4) The block is halved d=

56 Hashing estendibile: inserimento 1000 (4) The block is halved d= SPLIT 56

57 Hashing estendibile: inserimento 1000 (4) The block is halved Keys are distributed within the halved blocks using d +1 bits d= SPLIT 57

58 Hashing estendibile: inserimento 1000 (4) The block is halved Keys are distributed within the halved blocks using d +1 bits. d' is increased by one d= SPLIT 58

59 Hashing estendibile: inserimento 1000 (5) The directory is updated with the pointer to the newly-created block d=

60 Hashing estendibile: inserimento 1000 (5) The directory is updated with the pointer to the newly-created block d=

61 How many blocks are there? A 3 B 5 C 6 D 8 d=

62 How many buckets are there? A 3 B 5 C 6 D 8 d=

63 Extendible hashing: discussion Pros: The space overhead for the directory is negligible Splitting causes minor reorganizations, since only the records in one bucket are redistributed to the two new buckets (e.g. records that start with the same bit sequence). Cons: Directory must be searched before accessing the buckets themselves: we have two block accesses (directory+data file buckets). This penalty is considered minor. The exponential memory increase in first memory is quite inefficient (it allocates more space than the one actually required) 63

64 Linear Hashing Hash file expands and shrinks without needing a directory. Overflow blocks are allowed. The number of buckets increases linearly. Such increase happens when (i) a overflow block is inserted or (ii) when a specific record-bucket ratio (file load factor, r/n) exeeds a guard value l (e.g. l=1.7). H i (K) returns the less significant i bits. 64

65 Linear Hashing: Search() Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n =

66 Linear Hashing: Search() Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=2 n=3 r=4 r/n = 1.33 H 2 () = 01 2 = 1 10 < 3 66

67 Linear Hashing: Search() Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=2 n=3 r=4 r/n = 1.33 H 2 () = 01 2 = 1 10 < 3 67

68 Linear Hashing: Search(1111) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n =

69 Linear Hashing: Search(1111) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H (2-1) = 1 10 = 01 2 i=2 n=3 r=4 r/n = 1.33 H 2 (1111) = 11 2 =

70 Linear Hashing: Search(1111) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H (2-1) = 1 10 = 01 2 i=2 n=3 r=4 r/n = 1.33 H 2 (1111) = 11 2 =

71 Linear Hashing: Insert (1) r records, n buckets, file load factor (r/n), guard l When the file load factor is exceeded, a bucket gets splitted and a new n+1 bucket is created. While using the hashing function H i, all the buckets up to the 2 i-1 -th are splitted. When n>2 i, H i is changed with H i+1 and the buckets are splitted, once again, from the first one 71

72 Linear Hashing: Insert (2) if H i (k) = m < n: store k in the m-th bucket (use overflows when necessary) Otherwise store k in the (m - 2 i -1 )-th bucket Increase r, if the r/n>l then: if n = 2 i, increase i by 1 (n) 2 =a 1 a 2 a i having a 1 =1 Clear the first bit in n and store it to m. a 1 a 2 a i 0a 2 a i Allocate the n-th block Move all the records from block m having the i-th rightmost bit as 1 into the n-th block. Increase n 72

73 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=1 n=2 r=3 r/n =

74 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=1 n=2 r=3 r/n = 1.5 H 1 () = 1 2 = 1 10 < 2 74

75 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. H i=1 n=2 r=3 r/n = 1.5 H 1 () = 1 2 = 1 10 < 2 75

76 Linear Hashing: Insert() (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r H i=1 n=2 r=4 r/n = >1.7 H 1 () = 1 2 = 1 10 < 2 76

77 Linear Hashing: Insert() (2) i=1 n=2 r=4 r/n = 2 77

78 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 = i=2 n=2 r=4 r/n = 2 78

79 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 = i=2 n=2 r=4 r/n =

80 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 =10 Move all the records from block m having the second rightmost bit as 1 into the n-th block. i=2 n=2 r= r/n = 2 80

81 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 =10 Move all the records from block m having the second rightmost bit as 1 into the n-th block. i=2 Increase n n=3 r= r/n = 2 81

82 Linear Hashing: Insert() (2) n = 2 1, increase i by 1 (i=2) (n) 2 =10 (m) 2 =00 Allocate the n-th block, having (n) 2 =10 Move all the records from block m having the second rightmost bit as 1 into the n-th block. i=2 Increase n n=3 r= r/n = r/n 1.33 = <

83 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n =

84 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 84

85 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 85

86 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket H i=2 n=3 r=4 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 86

87 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r 0001 H i=2 n=3 r=5 r/n = 1.33 H 2 (0001) = 01 2 = 1 10 < 3 87

88 Linear Hashing: Insert(0001) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r 0001 H i=2 n=3 r=5 r/n r/n = = < 1.7 H 2 (0001) = 01 2 = 1 10 < 3 88

89 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket i=2 n=3 r=5 r/n = H

90 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket i=2 n=3 r=5 r/n = H H 2 (0110) = 10 2 = 2 < 3 90

91 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket i=2 n=3 r=5 r/n = H H 2 (0110) = 10 2 = 2 < 3 91

92 Linear Hashing: Insert(0110) (1) Given n buckets (2 i-1 < n 2 i ). H i (K) = m < n K is in the m-th bucket. H i (K) = m n K is in the (m - 2 i-1 )-th bucket. Increase r 0110 H i=2 n=3 r=6 r/n r/n = 2 = > H 2 (0110) = 10 2 = 2 < 3 92

93 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 = i=2 n=3 r=6 r/n = 2 93

94 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = i=2 n=3 r=6 r/n = 2 94

95 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block i=2 n=3 r=6 r/n = 2 95

96 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block i=2 n=3 r=6 r/n = 2 96

97 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block. Increase n i=2 n= r= r/n = 2 97

98 Linear Hashing: Insert(0110) (2) n 2 2 (n) 2 =11 (m) 2 =01 Allocate the n-th block, having n = 11 2 Move all the records from block m having the second rightmost bit as 1 into the n-th block. Increase n i=2 n= r= r/n r/n = 1.5 = 2<

99 How many blocks are there? A 2 B 3 C 4 D i=2 n=3 r=

100 How many buckets are there? A 2 B 3 C 4 D i=2 n=3 r=

101 Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). These days we frequently think first of web search, but there are many other cases: search Searching your laptop Corporate knowledge bases Domain specific search (i.e. Legal information retrieval) Graph databases

of keywords to express an information need Goal: Retrieve documents with

102 Basic Assumptions Document: unstructured file Collection: A set of documents Assume it is a static, non-hypertext collection for the moment Query: set (of sequence of) of keywords to express an information need Goal: Retrieve documents with information that is relevant to the user s information need and helps the user complete a task

103 Inverted indexes Inverted indexes are used to efficiently retrieve unstructured documents through fulltext queries. Such indices are inverted because the index all the documents where a given word appears. A document collection. D={d 1,,d n } A vocabulary V is a set of distinct terms in the document set An inverted index IX of a document collection is a vocabulary that attaches distinct terms with a list of all documents that contains the term. for each v in V for each d i in D s.t. v d i IX[v] = { (Di,j) d i [j] = v } 103

104 Generating the Inverted Index (1) word number D1 D2 Una mattina, svegliandosi da sogni inquieti, Gregor Samsa si trovò nel suo letto trasformato in un insetto mostruoso Voi che trovate tornando a casa il cibo caldo e visi amici Considerate se questo è un uomo da 4 Gregor 7 in 15 un 16 a 5 casa 6 che 2 D3 vidi un magnifico disegno. Rappresentava un serpente boa nell atto di inghiottire un animale animale 14 atto 10 boa 8 un 2, 6,

105 Generating the Inverted Index (1) a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) da 4 Gregor 7 in 15 un 16 a 5 casa 6 che 2 animale 14 atto 10 boa 8 un 2, 6,

106 Inverted Index: queries (1) Return documents containing un AND atto. a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) D3 D1, D3 D3 D3 vidi un magnifico disegno. Rappresentava un serpente boa nell atto di inghiottire un animale 106

107 Stemming Reduce terms to their roots before indexing and reduces the size of the inverted index. Stemming suggests crude postfix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress

108 Stop words With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: They have little semantic content: the, a, and, to, be There are very frequent: ~30% of postings for top 30 words But the trend is away from doing this: It significantly reduces the size of an inverted index It allows query optimizations because the machine-representation of a query won t include such words. On the other hand, you need them for: Sentence Queries: King of Denmark Various song titles, etc.: Let it be, To be or not to be Relational queries: flights to London

109 Inverted Index: queries (2) Return documents containing un atto. a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) (D3,10) (D1,16), (D3,2), (D3,6), (D3,13) D3: 2, 10 6, 10 13,

110 Inverted Index: queries (3) Return documents containing un animale. a (D2,5) animale (D3,14) atto (D3,10) boa (D3,8) casa (D2,6) che (D2,2) da (D1,4) Gregor (D1,7) in (D1,15) un (D1,16), (D3,2), (D3,6), (D3,13) (D3,14) (D1,16), (D3,2), (D3,6), (D3,13) D3: 2, 14 6, 14 13, 14 D3 vidi un magnifico disegno. Rappresentava un serpente boa nell atto di inghiottire un animale 110

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the