Of Hashes and Pythons
Joachim Jablon
I posted the following code on a tweet today and subsequently had a very interesting talk answering questions about this, which I’ll sumarize here. There might also be new material, too.
So what’s happening here ?
Here’s a textual version of the code above.
First I create a class A
whose __hash__
method returns the result of id(self)
which mean more or less “the value of a pointer on self
.
Then I create a class B
whose __hash__
method returns something quite simpler : the constant 1.
Putting ten thousand instances of A
in a set takes 4 ms, putting 10 thousand instances of B
should logically take less, because the hash is harder to compute and you don’t have to store ten thousand different hashes, right ? Nope. It take longer. 317 times longer.
Note: The hash of an object is just an integer with the following property : if 2 objects are equal, then their hashes are equal too. (The opposite is not always true, see the remarks at the end of the post)
Note 2: In this post, I’m talking a lot about sets. I could say exactly the same thing about dicts. Sets are easier to manipulate in this example.
What happens when I put something in a set ?
Let’s play a bit. Introducing Verbose, a class that talks all the time.
Instances have a name, and we do 2 things:
- If 2 instances share the same name, we say they’re equal, through
__eq__
(and talk a bit). - Through
__hash__
, we set the hash of instances to be solely based on their name (and talk a bit).
Play time !
When I put an object in a set, these things happen :
- We compute the hash of the object
- We see if the hash is already in our set
- If so, we get the objects that share the same hash, they’re potential matches
- We check if our object is equal to any of that list.
- If so, it’s already in our set, so no need to add it again.
What would happen if 2 objects had the same hash, but were not equal ?
Thats 15 comparisons for 6 objects. Each object gets compared to all the others. The number of comparison is 6 * 5 / 2
So this is what is happening for our initial bad example : there are 10 000 objects that all get compared to each other.
That’s ~50 millions.
At this point, you might ask :
“But you said elusively in your earliest list that ‘We see if the hash is already in our set’. Isn’t it the same problem ? We have to check all the values !”
And yes, if a hash could be anything, that would be true. Fortunately, we know something about hashes : they’re integers. A comparison between any objects can give 2 results : equal or not equal. A comparison between integers can give 3 results : less, equal, greater.
What happens behind the scene may depend on the python implemention. Better than telling you things that Python does that we cannot test here, let’s give an example of what a Python implementation might do.
Python could place all our values in a Binary Tree (btree
). A binary tree is a way to organize sortable elements that look a lot like the game where you have to guess someone’s number when they only tell you plus or minus. This allows for very efficient fetching of your data (log2(n)
)
So that’s it, then
Now we know why the A
class is fast and the B
class is so very very slow.
Thank you very much. The show’s finished. What ? You want more ? Ok, let’s get you more !
What happens when the hash is not really reliable ?
Why do they say mutable objects should not be hashable ?
(Mutable objects mean objects that you can modify after they’ve been created). Hashes are computed only when the object is first put in a set.
We’ve mutated our object a
to a b
after we put it in a set, and now the set contains 2 equal b
objects. We’ve broken our set ! So that’s why list
s and dict
s (which are mutables) cannot be put in sets, but tuple
s and frozenset
s (which are immutable) can.
Do we really need hash AND eq to compare objects ? isn’t hash alone enough ?
Given that there’s a lower and upper limit to the hashes integer values, there are not lots of possible hashes compared to the number of different objects that might exist, so there are not enough integers to give a unique one to all the possible objects in universe. Here are a few examples :
So, would you use Python if you knew that sometimes, you get a random object disappearing from a set or a dict just because it has the same hash as an existing object ? You’d probably run away, and I’d do too. But it’s ok, Python has our back. \o/
[Update :] Is this a security issue ?
Well it can be. I didn’t knew about this until @rami showed me, but, yes, this can be quite a security issue because if it can help an attacker to bring your server on its knees. See the video and the discussion on the subject.
[Another Update : ] More details on dicts implementations in CPython (and PyPy)
For more details on the exact implementation of dictionnaries (and a few words on sets that don’t work the same, as I learned), I strongly recommand @raymondh’s talk on this matter.
Conclusion
That’s all, folks. Thanks a lot to @rixx for the inspiration of writing this as a blog post.
I’m all ears for suggestions and improvements regarding both the content and the format of this. I might do it again later.
See ya !
Joachim Jablon
(Creative Commons : BY-NC)