In [1]:
class A(object):
def __hash__(self):
return id(self)
In [2]:
class B(object):
def __hash__(self):
return 1
In [3]:
%%timeit
{A() for a in range(10000)}
In [4]:
%%timeit
{B() for a in range(10000)}
Here's a textual version of the code above.
First I create a class A
whose __hash__
method returns the result of id(self)
which mean more or less "the value of a pointer on self
.
Then I create a class B
whose __hash__
method returns something quite simpler : the constant 1.
Putting ten thousand instances of A
in a set takes 4 ms, putting 10 thousand instances of B
should logically take less, because the hash is easier to compute and you don't have to store ten thousand different hashes, right ? Nope. It take longer. 317 times longer.
Note: The hash of an object is just an integer with the following property : if 2 objects are equal, then their hashes are equal too. (The opposite is not always true, see the remarks at the end of the post)
Note 2: In this post, I'm talking a lot about sets. I could say exactly the same thing about dicts. Sets are easier to manipulate in this example.
Let's play a bit. Introducing Verbose, a class that talks all the time.
Instances have a name, and we do 2 things:
__eq__
(and talk a bit).__hash__
, we set the hash of instances to be solely based on their name (and talk a bit).
In [5]:
class Verbose(object):
def __init__(self, name):
self.name = name
def __eq__(self, other):
"""
__eq__(for "equal") is used to determine
the result of:
Verbose() == other
"""
equal = self.name == other.name
print("Is {} equal to {} ? {}.".format(self, other, equal))
return equal
def __hash__(self):
"""
Hashable objects need to implement __hash__,
that should return integers, with this
property: 2 equal objects should return
the same hash.
"""
hash_value = hash(self.name)
print("Asking the hash for {} which is {}.".format(self, hash_value))
return hash_value
def __repr__(self):
"""
This is just so that our prints looks clean
"""
return self.name
Play time !
In [6]:
Verbose("a") == Verbose("b")
Out[6]:
In [7]:
{Verbose("a")}
Out[7]:
In [8]:
{Verbose("a"), Verbose("b")}
Out[8]:
In [9]:
{Verbose("a"), Verbose("a")}
Out[9]:
When I put an object in a set, these things happen :
What would happen if 2 objects had the same hash, but were not equal ?
In [10]:
class Verbose2(Verbose):
def __hash__(self):
return 1
In [11]:
{
Verbose2("a"), Verbose2("b"), Verbose2("c"),
Verbose2("d"), Verbose2("e"), Verbose2("f")
}
Out[11]:
Thats 15 comparisons for 6 objects. Each object gets compared to all the others. The number of comparison is 6 * 5 / 2
In [12]:
n = 6
In [13]:
def number_of_comparisons(n):
return n * (n - 1) // 2
number_of_comparisons(6)
Out[13]:
So this is what is happening for our initial bad example : there are 10 000 objects that all get compared to each other.
In [14]:
number_of_comparisons(10000)
Out[14]:
That's ~50 millions.
"But you said elusively in your earliest list that 'We see if the hash is already in our set'. Isn't it the same problem ? We have to check all the values !"
And yes, if a hash could be anything, that would be true. Fortunately, we know something about hashes : they're integers. A comparison between any objects can give 2 results : equal or not equal. A comparison between integers can give 3 results : less, equal, greater.
What happens behind the scene may depend on the python implemention. Better than telling you things that Python does that we cannot test here, let's give an example of what a Python implementation might do.
Python could place all our values in a Binary Tree (btree
). A binary tree is a way to organize sortable elements that look a lot like the game where you have to guess someone's number when they only tell you plus or minus. This allows for very efficient fetching of your data (log2(n)
)
In [15]:
import random
class C(object):
def __hash__(self):
return random.randint(0, 10)
In [16]:
c = C()
s = {c}
print(
"Is c in the set of s that contains only c ?",
c in s
)
print(
"And what if I ask you 20 times the question ?",
any(c in s for i in range(20))
)
(Mutable objects mean objects that you can modify after they've been created). Hashes are computed only when the object is first put in a set.
In [17]:
a = Verbose("a")
s = {a}
a.name = "b"
s.add(Verbose("b"))
In [18]:
s
Out[18]:
We've mutated our object a
to a b
after we put it in a set, and now the set contains 2 equal b
objects. We've broken our set ! So that's why list
s and dict
s (which are mutables) cannot be put in sets, but tuple
s and frozenset
s (which are immutable) can.
Given that there's a lower and upper limit to the hashes integer values, there are not lots of possible hashes compared to the number of different objects that might exist, so there are not enough integers to give a unique one to all the possible objects in universe. Here are a few examples :
In [19]:
hash(1.000000000000001) == hash(2561)
Out[19]:
In [20]:
hash(object) == hash(hash(object))
Out[20]:
So, would you use Python if you knew that sometimes, you get a random object disappearing from a set or a dict just because it has the same hash as an existing object ? You'd probably run away, and I'd do too. But it's ok, Python has our back. \o/
Well it can be. I didn't knew about this until @rami showed me, but, yes, this can be quite a security issue because if it can help an attacker to bring your server on its knees. See the video and the discussion on the subject.
That's all, folks. Thanks a lot to @rixx for the inspiration of writing this as a blog post.
I'm all ears for suggestions and improvements regarding both the content and the format of this. I might do it again later.
See ya !
Joachim Jablon
(Creative Commons : BY-NC)