prev:[[61B-part-5-1-BST]]
next:

prev: BSTs

disadvantages:

  1. comparable: need to be able to ask “X<Y”questions
  2. O(logN) can be better

extreme approach: using data as the index

  • initial all values as false
  • when an item is added, the index is set to be true.
    drawback: extreme waste of memory
    suppose we want to add “cat”
    problems:
  • what is the catth element of the list? –use cas the index
  • other words started with c collides with cat, and cannot store words like =equals()
    solution: use the “power of 27”
    $$cat_{27} = 327^2+127^1+20*27^0=2234_{10}$$
    the algs will give each lowercase English word a unique number!
    ASCII Characters: >=127进制
    Chinese characters: >=40959进制
    However, the problem: integer overflow
    if biggest number in int, it +1 = smallest number!
    the code: Hashcode
    challenges:
  • how to handle hashcoode collisions?(collision handling_)
  • how can we compute a hash code for arbitrary objects? (computing a hashCode)
    solution: instead of storing true in position h, store a “bucket” oif these N items at the position h.
    The bucket implementation: linked list, arraylist,arrayset…

separate chaining data indexed array

![[Pasted image 20240421132226.png]]

the bucket h is empty –> create a new list containing x, store it at index h.
the bucket h is already a list –> add x to this list if it is not already present.
we can use the modulus of hashcode to reduce bucket count! (downside: lists will be longer)

improvement

suppose:

  • fixed number of buckets M
  • increasing number of N
    problem: even if the items are spread out evenly, the lists are of length Q=N/M, therefore the complexity is O(N).

example strategy

when N/M >= 1.5, double M
(N/M: “load factor”) how full the hast tabhle is.
Therefore, now we have:

  • increasing number of buckets M
  • increasing number of items N
    O(N/M) =O(1)

Runtime

One important thing to consider: resizing operation.

  • resizing takes O(N) time
  • add takes O(1)+ time (O(N) when resizing)
  • contains takes O(1)+ time on average.

Ubiquity

  • great performance in practice
  • implementations are simple
  • python dictionary are just hash tables.
  • used in HashMap and HashSet, not implements Hashable (all objects in Java must implement a .hashcode() method)

Negative hash codes

unfortunately, -1%4=-1 will result in index errors.
Therefore, use Math.floorMod instead.

Warning

  • never store objects that can change in HashSet or HashMap!
  • never override equals without override hashCode!

HashCode

goal: make the table bushy
java’s HashCode() for strings: 31-based
It seemed like that 126 is better. However, the overflow is a bad problem for base 126!
major collision problem for base 126: Any string that ends in the same last 32 characters has the same hash code!
$$126^{32}=126^{33}=126^{34}=…0$$
See 61C for more.
Typical hash code: a small prime.