Data Structure. Mohsen Arab. January 13, Yazd University. Mohsen Arab (Yazd University ) Data Structure January 13, / 86

Data Structure Mohsen Arab Yazd University January 13, 2015 Mohsen Arab (Yazd University ) Data Structure January 13, 2015 1 / 86

Table of Content Binary Search Tree Treaps Skip Lists Hash Tables Mohsen Arab (Yazd University ) Data Structure January 13, 2015 2 / 86

Fundamental Data-structuring Problem fundamental data-structuring problem: maintain a collection {S 1, S 2,...} of sets of items to efficiently support certain types of queries and operations: MAKESET(S): create a new (empty) set S. INSERT(i, S): insert item i into the set S. DELETE(k,S): delete the item indexed by the key value k from the set S. FIND(k, S): return the item indexed by the key value k in the set S. JOIN(S 1, i, S 2 ): replace the sets S 1 and S 2 by the new set S = S 1 {i} S 2, where 1 for all items j S 1, k(j) < k(i), 2 for all items j S 2, k(j) > k(i). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 3 / 86

Fundamental Data-structuring Problem(cont.) Paste(S 1, S 2 ): replace the sets S 1 and S 2 by the new set S = S 1 S 2, where for all items i S 1 and j S 2, k(i) < k(j). Split(k,S): replace the set S by the new sets S 1 and S 2 where S 1 = {j S k(j) < k} S 2 = {j S k(j) > k} Mohsen Arab (Yazd University ) Data Structure January 13, 2015 4 / 86

binary search tree binary search tree: binary tree in which keys satisfy search tree property. Definition Search tree property: for all nodes with key value k, the left sub-tree contains only key values smaller than k and the right sub-tree contains only key values larger than k. the key values in binary tree are in symmetric order, if they satisfy search tree property. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 5 / 86

we will assume BST are endogenous. Definition Endogenous: all key values are stored at internal nodes, and all leaf nodes are empty. This will ensure that the trees are full, which means that every non-leaf (internal) node has exactly two children. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 6 / 86

standard implementations of operations MakeSet(S): initialize an empty tree for the set S. Joint(S 1,k,S 2 ): create a node containing key k as root, make S 1 and S 2 as its left and right sub-tree respectively. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 7 / 86

search Example:FIND(4, S) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 8 / 86

Insert perform Find(k,S), insert k where search fails (into the empty leaf node) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 9 / 86

implementations of operations(delete) Delete(K,S): 1) if the node v containing k has a leaf as one of its two children. For example, if the right child of v is a leaf, then replace v by L( v) as the child of P(v). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 10 / 86

implementations of operations(delete) 2. If neither of the children is a leaf, let k be the key value that is the predecessor of k in the set S. Now, we can delete the node containing k since its right child is a leaf, and replace the key value k by k in the node v. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 11 / 86

implementations of operations(cont.) Note PASTE(S 1, S 2 ): 1 delete the largest key value, say k, from S 1. 2 apply JOIN(S 1,k,S 2 ). k can be found by doing a FIND(,S 1 ). SPLIT(k, S): if k is at the root of S, do the reverse of the steps employed in JOIN(S 1,k,S 2 ). else, make use of rotations to move it to the root. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 12 / 86

Problem: Each operation can be performed in time proportional to the height of the tree. There is sequence of INSERT operations that result in tree of height linear in n. Solution: Perform rotations during update operations to ensure having all leaves in distance O(log n) from the root. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 13 / 86

Rotations Each type of rotation moves a node together with one of its sub-trees closer to the root (and some others away from the root), while preserving the search tree property. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 14 / 86

A different strategy: Splaying in self-adjusting search tree Splaying the splay operation moves a specified node to the root via a sequence of rotations Amortization partitioning of the total cost of a sequence of operations among the individual operations in that sequence. Thus, amortized time bound can be viewed as the average cost of the operations in a sequence Mohsen Arab (Yazd University ) Data Structure January 13, 2015 15 / 86

idea behind self-adjusting trees to use a particular implementation of the splay operation to move to the root a node accessed by a FIND operation How it can benefit us nodes which accessed often enough, remain close to root. Thus, total running time will increase not very much for an infrequently accessed node,total running time will not increase very much in any case. Note These self-adjusting trees guarantee only amortized logarithmic time per operation. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 16 / 86

Advantages and drawbacks of self-adjusting trees Advantages They are relatively simple to implement. do not require explicit balance information to be stored at nodes splay trees can be shown to be optimal with respect to arbitrary access frequencies for the items being stored. Drawbacks they restructure the entire tree during updates and even simple search operations. during any given operation splay trees may perform a logarithmic number of rotations we do not have the guarantee that every operation will run quickly Mohsen Arab (Yazd University ) Data Structure January 13, 2015 17 / 86

Treaps treaps are efficient randomized alternative to the balanced tree and self-adjusting tree. Treaps achieve essentially the same time bounds in the expected sense, but with following advantages: 1 do not require any explicit balance information 2 expected number of rotations performed is small for each operation 3 They are extremely simple to implement Mohsen Arab (Yazd University ) Data Structure January 13, 2015 18 / 86

binary search tree A (full, endogenous) binary tree whose nodes have key values associated with them is a binary search tree if the key values are in the symmetric order heap If the key values decrease monotonically along any root-leaf path, we call the structure a heap and say that the keys are stored in a heap order. treap Consider a binary tree where each node v contains a pair of values: a key k( v) as well as a priority p( v). We call this structure a treap if it is a binary search tree with respect to the key values and, simultaneously, a heap with respect to the priorities Mohsen Arab (Yazd University ) Data Structure January 13, 2015 19 / 86

example of treaps S = {(k 1,p 1 ),...,(k n,p n )} S={(2, 13), (4, 26), (6,19), (7, 30), (9,14), (11, 27), (12, 22)} Mohsen Arab (Yazd University ) Data Structure January 13, 2015 20 / 86

Theorem 8.1 Let S = {(k 1,p 1 ),...,(k n,p n )} be any set of key-priority pairs such that the keys and the priorities are distinct.then, there exists a unique treap T(S) for it. proof: It is obvious that the theorem is true for n = 0 and for n = 1. Suppose now that n 2, and assume that (k 1, p 1 ) has the highest priority in S. Then, a treap for S can be constructed by putting item 1 at the root of T(S). A treap for the items in S of key value smaller (larger) than k 1 can be constructed recursively, and this is stored as the left (right) sub-tree of item 1. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 21 / 86

implementation of Operations using treap MAKESET(S) or a FIND(k, S) operation exactly as before. INSERT(k, S): Do FIND(k, S) and inserting k at the empty leaf node where the search terminates with failure. if heap order property is violated ( parent(k).p < k.p): Repeat: decrease k s depth by performing a rotation at node w= parent(k) so that k becomes the parent of w. until k either becomes the root or parent(k).p > k.p. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 22 / 86

implementation of Operations using treap: Add(), Example Mohsen Arab (Yazd University ) Data Structure January 13, 2015 23 / 86

implementation of Operations using treap: Delete(), Example DELETE(k, S): operation is exactly the reverse of an insertion downward until both its children are leaves, and then simply discard the node. Note: The choice of the rotation (left or right) at each stage depends on the relative order of the priorities of the children of the node being deleted. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 24 / 86

Delete(), Example Mohsen Arab (Yazd University ) Data Structure January 13, 2015 25 / 86

JOIN(S 1, k, S 2 ): operation as before, and the resulting structure is a treap provided the priority of k is higher than that of any item in S 1 or S 2. If the new root (containing k) violates the heap order, we simply rotate that node downward until each of the two children of the node has a smaller priority or is a leaf. PASTE(S 1, S 2 ): As in BST. SPLlT(k, S): 1 delete k from S. 2 inserting it into S with a priority of. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 26 / 86

left spine of a tree: the path obtained by starting at the root and repeatedly moving to the left child until a leaf is reached; the right spine is defined similarly. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 27 / 86

Mulmuley Games Mulmuley games are useful abstractions of processes underlying the behavior of certain geometric algorithms. The cast of characters in these games is: P = { P 1,...,P p } S = { S 1,...,S s } T = { T 1,...,T t } B = { B 1,...,B b } The set P S is drawn from a totally ordered universe. all players are smaller than all stoppers: for all i and j, P i < S j Mohsen Arab (Yazd University ) Data Structure January 13, 2015 28 / 86

Exercise 8.5: Let H k = k i=1 1/i. denote the kth Harmonic number. Show that: n k=1 H k = (n + 1)H n+1 (n + 1) Recall that H k = Ink + O(1) (Proposition B.4). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 29 / 86

Depending upon the set of active characters, we formulate four different games, with each game being more general than the previous one. Game A. initial set of characters X = P B. The game proceeds by repeatedly sampling from X without replacement, until the set X becomes empty. random variable V: the number of samples in which a player P i is chosen such that P i is larger than all previously chosen players. value of the game A p = E[V ] Mohsen Arab (Yazd University ) Data Structure January 13, 2015 30 / 86

Lemma 8.2: For all p 0, A p = H p. Proof: Assume that the set of players is ordered as P 1 > P 2 >... > P p. in Game A, bystanders are not considered, so we can set b=0. if the first chosen player is P i, the expected value of the game is 1 + A i 1. A p = p 1+A i 1 i=1 p = 1 + p A i 1 i=1 p Upon rearrangement, using the fact that A 0 = 0, p 1 i=1 A i = pa p p. By Exercise 8.5: Harmonic numbers are the solution to the above equation. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 31 / 86

Game C. initial set of characters X = P B S. the stoppers are treated as players. But the game stops when a stopper is chosen for the first time. value of the game C s p = E[V + 1] = E[V ] + 1 Note since all players are smaller than all stoppers, we will always get a contribution of 1 to the game value from the first stopper. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 32 / 86

Lemma 8.3 Lemma 8.3 For all p, s 0, C s p = 1 + H s+p - H s. Proof Assume that the set of players is ordered as P 1 > P 2 >... > P p. As in Game A, bystanders are not considered, so we can set b=0. if the first sample is P i,the probability of the this event is s/(s + p). The expected game value is 1 + C s i 1. if the first sample is a stopper,,the probability of the this event is s/(s + p). The game value is 1... Mohsen Arab (Yazd University ) Data Structure January 13, 2015 33 / 86

Proof of Lemma 8.3 Proof of Lemma 8.3 (cont.).. Cp s = ( s s+p 1) + ( 1 s+p p i=1 (1 + C i 1 s )). Upon rearrangement, using the fact that C0 s Cp s = s+p+1 p 1 i=1 s+p + C i s s+p which is equivalent to p 1 i=1 C s i = (s + p)c s p (s + p + 1). = 1, we obtain that Once again, using Exercise 8.5 it can be verified that the solution to the recurrence is given by C s p = 1 + H s+p H s. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 34 / 86

Game D and E. Games D and E are similar to Games A and C, But: in Game D, X = P B T and in Game E, X = P B S T. The role of the triggers is that the counting process begins only after the first trigger has been chosen. a player or a stopper contributes to V only if it is sampled after a trigger and before any stopper (and of course it is larger than all previously chosen players). i.e, Mohsen Arab (Yazd University ) Data Structure January 13, 2015 35 / 86

Lemma 8.4: For all p, t 0, Dp t = H p + H t H p+t. Lemma 8.5: For all p,s,t 0,Ep s,t = t s+t + (H s+p H s ) (H s+p+t H s+t ). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 36 / 86

Analysis of Treaps memory less property Since the random priorities for the elements of S are chosen independently, we can assume that the priorities are chosen before the insertion process is initiated Once the priorities have been fixed, Theorem 8.1 implies that the treap T is uniquely determined. This implies that the order in which the elements are inserted does not affect the structure of the tree. without loss of generality, we can assume that the elements of set S are inserted into T in the order of decreasing priority. An advantage of this view is that it implies that all insertions take place at the leaves and no rotations are required to ensure the heap order on the priorities. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 37 / 86

Lemma 8.6 Let T be a random treap for a set S of size n. For an element x S having rank k, E(depth(x)) = H k + H n k+1 1 idea of proof S = {y S y x},s + = {y S y x} Since x has rank k, it follows that S = k, S + = n k + 1 Q x S: the ancestors of x Q x = S Q x, Q + x = S + Q x we will establish that E[ Q x ] = H k. By By symmetry, it follows that E[ Q + x ] = H n k+1 1 Mohsen Arab (Yazd University ) Data Structure January 13, 2015 38 / 86

Mohsen Arab (Yazd University ) Data Structure January 13, 2015 39 / 86

Consider any ancestor y Q x of the node x. By the memoryless assumption, y must have been inserted prior to x: p y > p x. Since y < x, it must be the case that x lies in the right sub-tree of y. search for every element z whose value lies between y and x (y < z < x) must follow the path from the root to y, and in fact go into the right sub-tree of y. We conclude that y is an ancestor of every node containing an element of value between y and x. By our assumption,z must have been inserted after y, and hence is of lower priority than y. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 40 / 86

.. Continue of proof.. The preceding argument establishes that an element y S is an ancestor of x, or a member of Q x ; if and only if it was the largest element of S in the treap at the time of its insertion. the order of insertion is determined by the order of the priorities, and the latter is uniformly distributed by the order of the priorities, Thus, the order of insertion can be viewed as being determined by uniform sampling without replacement from the pool S. We can now claim that the distribution of Qx is the same as that of the value of Game A when P = S and B = S\S. Since S = k, the expected size of Qx = H k Mohsen Arab (Yazd University ) Data Structure January 13, 2015 41 / 86

For any element x in a treap, L x : length of the left spine of the right sub-tree of x. R x : length of the right spine of the left sub-tree of x. Lemma 8.7 Let T be a random treap for a set S of size n. For an element X S of rank k, E[R x ] = 1 1 k, E[L x] = 1 1 n k+1 Mohsen Arab (Yazd University ) Data Structure January 13, 2015 42 / 86

proof: (1)an element z < x lies on the right spine of the left sub-tree of x if and only if (2) z is inserted after x, and all elements y whose values lie between z and x (z < y < x) are inserted after z. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 43 / 86

proof z is inserted after x, and all elements y whose values lie between z and x (z < y < x) are inserted after z element z lies on the right spine of the left sub-tree of x. a. if x is ancestor of z: if x doesn t lie on the spine right of left sub-tree x, then: z < u < x (or z < v < x ) and since u (or v) is ancestor of z, it is inserted before z (contradiction). b. if x is not ancestor of z: let w be lowest common ancestor of z and x. we wee that z < w < x and since w is ancestor of z, it should have been inserted before z (contradiction). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 44 / 86

Proof (1) (2): an element z < x lies on the right spine of the left sub-tree of x z is inserted after x, and all elements y whose values lie between z and x (z < y < x) are inserted after z. since x is ancestor of z, so it is have been inserted before z. Also, since all element y (z < y < x) should be inserted in the right sub-tree of z, then they will be inserted after z. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 45 / 86

Mohsen Arab (Yazd University ) Data Structure January 13, 2015 46 / 86

Search in Skip List We search for a key x in a a skip list as follows: We start at the first position of the top list At the current position p, we compare x with y key(next(p)) x = y: we return element(next(p)) x> y: we scan forward x <y: we drop down Example: search for 78 Mohsen Arab (Yazd University ) Data Structure January 13, 2015 47 / 86

Tree representation of a skip list Mohsen Arab (Yazd University ) Data Structure January 13, 2015 48 / 86

Analyzing Random Skip Lists A random leveling of the set S is defined as follows: Given the choice of level L i, the level L i+1 is defined by independently choosing to retain each element x L i with probability he process starts with L 1 = S and terminates when a newly constructed level is empty. alternate view: let the levels l(x) for x S be independent random variables, each with the geometric distribution with parameter p=1/2. Let r be max x S (l(x)) + 1 Place x in each of the levels L 1,..., L l(x). Like random Treaps, a random level is chosen for every element of S upon its insertion and remains fixed until the element is deleted. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 49 / 86

Lemma 8.9 The number of levels r in a random leveling of a set S of size n has expected value E[r] = O(logn). Moreover, r = O(logn) with high probability. Proof: r = max x S (l(x)) + 1. Levels l(x) are i.i.d. random variables distributed geometrically with parameter 1/2. pr[max i X i > t] n(1 p) t = n 2 t, we have p=1/2, with choosing t = αlogn and r = max i x i we have: for any α > 1. pr[r > αlogn] 1 n α 1 Mohsen Arab (Yazd University ) Data Structure January 13, 2015 50 / 86

lemma 8.10 Define I j (Y ) as the interval at level j that contains y. For an interval I at level i + 1, c(i) denotes the number of children it has at level i. Lemma 8.9 The number of levels r in a random leveling of a set S of size n has expected value E[r] = O(log n). Moreover, r = O(log n) with high probability. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 51 / 86

Hash Tables 1 static dictionary: we are given a set of keys S and must organize it into a data structure that supports the efficient processing of FIND queries. 2 dynamic dictionary: set S is not provided in advance. Instead it is constructed by a series of INSERT and DELETE operations that are intermingled with the FIND queries. Data Structuring problem All data structures discussed earlier require (logn) time to process any search or update operation. These time bounds are optimal for data structures based on pointers and search trees we are faced with a logarithmic lower bound. These time bounds are based on the fact that the only computation we can perform over the keys is to compare them and thereby determine their relationship in the underlying total order. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 52 / 86

Hash Tables Suppose: keys in S are chosen from a totally ordered universe M of size m. w.l.o.g, M = {0,..., m 1} keys are distinct. The idea: Create an array T [0..m 1] of size m in which T[k]=1 if k S T[k] = NULL otherwise This is called a direct-address table Operations take O(1) time. So whats the problem? Mohsen Arab (Yazd University ) Data Structure January 13, 2015 53 / 86

Direct addressing works well when the range m of keys is relatively small. But what if the keys are 32-bit integers? Problem 1: direct-address table will have 2 32 entries, more than 4 billion. Problem 2: even if memory is not an issue the time to initialize the elements to NULL may be. we want to reduce the size of the table to value close to S, while maintaining the property that a search or update can be performed in O(1) time. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 54 / 86

A table T consisting of n cells indexed by N = {0,..., n 1} A hash function h(), which is a mapping from M into N n < m,otherwise use direct address table. collision occurs when: two distinct keys x and y map in A collision occurs when: two distinct keys x and y map in the same location, i.e. h(x) = h(y). Goal: maintain a small table, and use hash function h to map keys into this table. If h behaves randomly, shouldn t get too many collisions. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 55 / 86

Hash Tables Chaining Chaining puts elements that collide in a linked list: Mohsen Arab (Yazd University ) Data Structure January 13, 2015 56 / 86

Universal Hash Families 2-universal Let M = {0,..., m 1} and N = {0,..., n 1}, with m n. A family H of functions from M into N is said to be 2-universal if, for all x, y M such that x y, and for h chosen uniformly at random from H, Pr[h(x) = h(y)] 1 n Mohsen Arab (Yazd University ) Data Structure January 13, 2015 57 / 86

define the following indicator function for a collision between the keys x and y under{ the hash function h: } 1 for h(x)=h(y) and x y δ(x, y, h)= 0 otherwise For all X, Y M, define the following extensions of the indicator function δ: δ(x, y, H) = Σ h H δ(x, y, h), δ(x, Y, h) = Σ y Y δ(x, y, h), δ(x, Y, h) = Σ x X δ(x, Y, h), δ(x, Y, H) = Σ y Y δ(x, y, H), δ(x, Y, H) = Σ h H δ(x, Y, h). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 58 / 86

Note For a 2-universal family H and any x y, we have δ(x, y, H) H /n. Theorem 8.12: For any family H of functions from M to N, there exist x, y M such that δ(x, y, H) > H n H m Mohsen Arab (Yazd University ) Data Structure January 13, 2015 59 / 86

Proof of Theorem 8.12 Proof Fix some function h H, and for each z N define the set of elements of M mapped to z as A z = {x M h(x) = z} The sets A z, for z N, form a partition of M. It is easy to verify that { } 0 w z δ(a w, A z, h)= A z ( A z 1) w = z The total number of collisions between all possible pairs of elements is minimized when these sets A z are all of the same size. We obtain δ(m, M, h) = z N A z ( A z 1) n( m n ( m n 1)) = m2 ( 1 n 1 m ) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 60 / 86

Proof(Cont.) Proof(Cont.) δ(m, M, H) = h H δ(m, M, h) H m2 ( 1 n 1 m ). By the pigeonhole principle. x, y M such that: δ(x, y, H) δ(m,m,h) m 2 = H δ(m,m,h) m 2 H m2 ( 1 n 1 m ) m 2 = H ( 1 n 1 m ) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 61 / 86

Lemma 8.13: For all x M, S M, and random h H, Proof: E(δ(x, S, h)) = h H = 1 H = 1 H = 1 H 1 H = S n. δ(x,s,h) H h H y S y S h H y S y S δ(x, y, H) H n E[δ(x, S, h)] S n δ(x, y, h) δ(x, y, h) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 62 / 86

in Our dynamic dictionary scheme : Notes a hash function h H is chosen uniformly at random, remains fixed during entire sequence of updates and queries. An inserted key x is stored at the location h(x), and due to collisions there could be other keys also stored at that location. The keys colliding at a given location are organized into a linked list Assuming that the set of keys currently stored in the table is S M, the length of the linked list is δ(x, S, h), which has expectation S /n. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 63 / 86

Theorem 8.14: Consider a request sequence R = R l,r 2... R r of update and search operations starting with an empty hash table. Suppose that this sequence contains S INSERT operations. Let ρ(h, R) denote the total cost of processing these requests using the hash function h H. Theorem 8.14: For any sequence R of length r with S INSERTS, and h chosen uniformly at random from a 2-universal family H, E[ρ(h, R)] r(1 + s n ) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 64 / 86

Constructing Universal Hash Families Fix m and n. choose a prime p m. We will work over the field z p = {0, 1,..., p 1}. let g : z p N be the function given by g(x) = x mod n. For all a, b z p, define the linear function f a,b : z p z p and the hash function h a,b : z p N as follows. f a,b (x)=ax+b mod p. h a,b (x) = g(f a,b (x)) =(ax+b mod p) mod n Mohsen Arab (Yazd University ) Data Structure January 13, 2015 65 / 86

We the family of hash functions H = { h a,b a, b z p with a 0 } Lemma 8.15 or all x, y z p such that x y, δ(x, y, H) = δ(z p, z p, g). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 66 / 86

proof Suppose that x and y collide under a specific function h a,b. Let f a,b (X ) = r and f a,b (y) = s. observe that r s since a 0 and x y.a collision takes place if and only if g(r) = g(s), or equivalently, r s (mod n). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 67 / 86

Now, having fixed x and y, for each such choice of r s, the values of a and b are uniquely determined by solution of: ax + b r (mod p) ay + b s (mod p) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 68 / 86

Theorem 8.16: The family H= {h a,b a, b Z p with a 0} is a 2-universal family. Proof: For each z N, let A z = {x z p with g(x) = z}; it is clear that A z p/n. In other words, for every r Z p there are at most p/n different choices of s Z p such that g(r)=g(s). Since there are p different choices of r Z p to start with, δ(z P, Z p, g) p( p p(p 1) n 1) n lemma 8.15: δ(x, y, H) = δ(z p, z p, g), This Proof: δ(z P, Z p, g) p(p 1) n, so: δ(x, y, H) p(p 1) n. Since H = p(p 1), Therefore: δ(x, y, H) H n. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 69 / 86

Definition 8.6 Let M = {0, 1,..., m 1} and N = {0, 1,..., n 1}, with m n,. A family H of functions from M into N is said to be strongly 2-universal if for all x 1 x 2 M, any y 1, y 2 N, and h chosen uniformly at random from H, pr[h(x 1 ) = y 1 and h(x 2 ) = y 2 ]= 1 n 2. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 70 / 86

Definition 8.7 Definition A family of hash functions H = h : M N, is said to be a perfect hash family if for each set S M of size s < n there exists a hash function h H that is perfect for S. Note: It is clear that perfect hash families exist: for example, the family of all possible functions from M to T, is a perfect hash family. Given a perfect hash family H, we solve static dictionary by: 1 finding h H perfect for S. 2 storing each key x S at the location T [h(x)]. 3 responding to a search query for a key q by examining the contents of T [h(q)]. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 71 / 86

The preprocessing cost: depends on the cost of identifying a perfect hash function for a specific choice of S. search cost: depends on the time required to evaluate the hash function. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 72 / 86

since the choice of the hash function will depend on the set S, its description must also be stored in the table. Suppose that the size of the perfect hash family H is r. storing the description of a hash function from H will require Ω(log r) bits. it is essential that the description of the hash function should fit into 0(1) locations in the table T. A cell in the table, can be used to encode at most log m bits of information. Note therefore, we will only be interested in constructing hash families whose size r is bounded by a polynomial in m Mohsen Arab (Yazd University ) Data Structure January 13, 2015 73 / 86

Exercise 8.13: Assume for simplicity that n = s. Show that for m = 2 Ω(s), there exist perfect hash families of size polynomial in m. Thus, The existence of a perfect hash family is guaranteed only for values of m that are extremely large relative to n. Exercise 8.14: Assuming that n = s, show that any perfect hash family must have size 2 Ω(s). Thus, we need to have m = 2 Ω(s), or s = O( 1og m), to guarantee even the existence of a perfect hash family of size polynomial in m. Unfortunately, in practice the case s = O(1og m) is not very interesting for typical values of m, e.g, for m=2 32. Solution: using double hashing. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 74 / 86

Definition 8.8 Let S M and h: M N. For each table location 0 i n 1, we define the bin B i (h, S) = {x S h(x) = i} The size of a bin is denoted by b i (h, S) = Bi(h, S). Definition 8.9: A hash function h is b-perfect for S if b i (h, S) b, for each i. A family of hash functions { h: M N } is said to be a b-perfect hash family if for each S M of size s there exists a hash function h H that is b-perfect for S. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 75 / 86

Exercise 8.15: Show that there exists a b-perfect hash family H such that b = O(log n) and H m, for any m n. Double hashing: At the first level we use a (log m)-perfect hash function h to map S into the primary table T. Consider the bin B i consisting of all keys from S mapped into a particular cell T[i]. elements of the bin B i mapped into the secondary table T i associated with that location using a secondary hash function h i. Since the size of B i is bounded by b, we can find a hash function h i that is perfect for B i provided 2 b is polynomially bounded in m. For b = O(log m) this condition holds. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 76 / 86

the double hashing scheme can be implemented with O( 1) query time, for any m n. the goal of the primary hash functions should be to create bins small enough that some perfect hash functions can be used as the secondary hash functions. Exercise.8.16: Consider a table of size r indexed by R={0,..., r 1}, show that there exists a perfect hash family H = {M R} with H m provided that r = Ω(s 2 ), for all m s. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 77 / 86

Towards our final solution We will use a primary table of size n = s, choosing a primary hash function that ensures that the bin sizes are small. the perfect hash functions from Exercise 8.16 are then used to resolve the collisions by using secondary hash tables of size quadratic in the bin sizes, Total space required by the double hashing scheme s + O( s 1 i=0 b2 i ) Mohsen Arab (Yazd University ) Data Structure January 13, 2015 78 / 86

Achieving Bounded Query Time Our goal now is: 1 to find primary hash functions which ensure that the sum of the squares of the bin sizes is linear. 2 to find perfect hash functions for the secondary tables, which use at most quadratic space. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 79 / 86

Definition 8.10: Consider any V M with V = v, and let R={0,..., r 1} with r v. For 1 k p - 1, define the function h k : M R as follows, h k (x)=(kx mod p) mod r. For each i R, the bins corresponding to the keys colliding at i are denoted as B i (k, r, V ) = {x V h k (x) = i} and their sizes are denoted by b i (k, r, V ) = B i (k, r, V ). Mohsen Arab (Yazd University ) Data Structure January 13, 2015 80 / 86

Lemma 8.17: For all V M of size v, and all r v, ( ) p 1 r 1 bi (k, r, V ) k=1 i=0 2 < (p 1)v 2 r = mv 2 r. Proof: The left-hand side of (8.2)counts the number of tuples (k, {x, y}) such that h k causes x and y to collide. i.e, 1 x,y V with x y, and 2 ((kx mod p) mod r) = ((ky mod p) mod r). The relation between k and x,y is as follows: k(x y) mod p {±r, ±2r, ±3r,..., ± (p 1)/r r} Mohsen Arab (Yazd University ) Data Structure January 13, 2015 81 / 86

proof(cont.) Since p is a prime and Z p is a field, for any fixed value of x - y there is a unique solution for k satisfying the equation k(x-y) mod p= jr for any value of j. This immediately implies that the number of values of k that cause a collision between x and y is at most 2(p 1) r. ( ) v Finally, noting that the number of choices of the pair {x, y} is. we 2 obtain p 1 k=1 r 1 i=0 ( ) bi (k, r, V ) 2 ( ) v 2(p 1) 2 r < (p 1)v 2 r Mohsen Arab (Yazd University ) Data Structure January 13, 2015 82 / 86

Corollary 8.18 For all V M of size v, and all r v, there exists k {1,..., m} such that ( ) r 1 bi (k, r, V ) i=0 < v 2 2 r. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 83 / 86

Theorem 8.19 For any S M with S = s and m s, there exists a hash table representation of S that uses space O(s) and permits the processing of a FIND operation in O( 1) time. proof: The double hashing scheme is as described above, and all that remains to be shown is that there are choices of the primary hash function h k and the secondary hash functions h k1,..., h ks that ensure the promised performance bounds. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 84 / 86

proof(cont.) Consider first the primary hash function h k. The only property desired of this function is that the sum of squares of the colliding sets (the bins) be linear in n to ensure that the space used by the secondary hash tables is O(s). Applying Corollary 8.18 to the case where V = S and R = T, implying that v = r = s, we obtain that there exists a k {I,..., m} such that or that s 1 i=0 ( ) bi (k, s, S) < s. 2 s 1 i=0 b i(k, s, S)[b i (k, s, S) 1)] < 2s. Since s 1 i=0 B i(k, s, S) = S and s 1 i=0 b i(k, s, S) = s, Mohsen Arab (Yazd University ) Data Structure January 13, 2015 85 / 86

s 1 i=0 b i(k, s, S) 2 < 2s + s 1 i=0 b i(k, s, S) = 3s Consider now the secondary hash function h ki for the set S j = B i (k, s, S) of size s i. Applying Corollary 8.18 to the case where V = S i (or v = s i ) and using a secondary hash table of size r=si 2, it follows that there exists a k i {1,..., m} such that s 2 i 1 j=0 ( bj (k i, si 2, S ) i) < 1. 2 where b b j (k i, s 2 i, S i) is the number of collisions at the jth location of the secondary hash table for T[i]. This can be the case only when each term of the summation is zero, implying that b j (k i, s 2 i, S i) 1 for all j. Thus, it follows that there exists a perfect secondary hash function h ki. Mohsen Arab (Yazd University ) Data Structure January 13, 2015 86 / 86