61B part 5.2 hashing
prev:[[61B-part-5-1-BST]]
next:
prev: BSTs
disadvantages:
- comparable: need to be able to ask “X<Y”questions
- O(logN) can be better
extreme approach: using data as the index
- initial all values as false
- when an item is added, the index is set to be true.
drawback: extreme waste of memory
suppose we want to add “cat”
problems: - what is the
catth element of the list? –usecas the index - other words started with c collides with
cat, and cannot store words like=equals()…
solution: use the “power of 27”
$$cat_{27} = 327^2+127^1+20*27^0=2234_{10}$$
the algs will give each lowercase English word a unique number!
ASCII Characters: >=127进制
Chinese characters: >=40959进制
However, the problem: integer overflow
if biggest number in int, it +1 = smallest number!
the code: Hashcode
challenges: - how to handle hashcoode collisions?(collision handling_)
- how can we compute a hash code for arbitrary objects? (computing a hashCode)
solution: instead of storingtruein position h, store a “bucket” oif these N items at the position h.
The bucket implementation: linked list, arraylist,arrayset…
separate chaining data indexed array
![[Pasted image 20240421132226.png]]
the bucket h is empty –> create a new list containing x, store it at index h.
the bucket h is already a list –> add x to this list if it is not already present.
we can use the modulus of hashcode to reduce bucket count! (downside: lists will be longer)
improvement
suppose:
- fixed number of buckets M
- increasing number of N
problem: even if the items are spread out evenly, the lists are of length Q=N/M, therefore the complexity is O(N).
example strategy
when N/M >= 1.5, double M
(N/M: “load factor”) how full the hast tabhle is.
Therefore, now we have:
- increasing number of buckets M
- increasing number of items N
O(N/M) =O(1)
Runtime
One important thing to consider: resizing operation.
- resizing takes O(N) time
- add takes O(1)+ time (O(N) when resizing)
- contains takes O(1)+ time on average.
Ubiquity
- great performance in practice
- implementations are simple
- python dictionary are just hash tables.
- used in
HashMapandHashSet, notimplements Hashable(all objects in Java must implement a.hashcode()method)
Negative hash codes
unfortunately, -1%4=-1 will result in index errors.
Therefore, use Math.floorMod instead.
Warning
- never store objects that can change in
HashSetorHashMap! - never override
equalswithout overridehashCode!
HashCode
goal: make the table bushy
java’s HashCode() for strings: 31-based
It seemed like that 126 is better. However, the overflow is a bad problem for base 126!
major collision problem for base 126: Any string that ends in the same last 32 characters has the same hash code!
$$126^{32}=126^{33}=126^{34}=…0$$
See 61C for more.
Typical hash code: a small prime.
