title: Of Hashes and Pythons

icon: code

I posted the following code on a tweet today and subsequently had a very interesting talk answering questions about this, which I'll sumarize here. There might also be new material, too.



In [1]:

    
class A(object):
    def __hash__(self):
        return id(self)



In [2]:

    
class B(object):
    def __hash__(self):
        return 1



In [3]:

    
%%timeit
{A() for a in range(10000)}









    



100 loops, best of 3: 4.63 ms per loop



In [4]:

    
%%timeit
{B() for a in range(10000)}









    



1 loop, best of 3: 1.42 s per loop

So what's happening here ?

Here's a textual version of the code above.

First I create a class A whose __hash__ method returns the result of id(self) which mean more or less "the value of a pointer on self.

Then I create a class B whose __hash__ method returns something quite simpler : the constant 1.

Putting ten thousand instances of A in a set takes 4 ms, putting 10 thousand instances of B should logically take less, because the hash is easier to compute and you don't have to store ten thousand different hashes, right ? Nope. It take longer. 317 times longer.

Note: The hash of an object is just an integer with the following property : if 2 objects are equal, then their hashes are equal too. (The opposite is not always true, see the remarks at the end of the post)

Note 2: In this post, I'm talking a lot about sets. I could say exactly the same thing about dicts. Sets are easier to manipulate in this example.

What happens when I put something in a set ?

Let's play a bit. Introducing Verbose, a class that talks all the time.

Instances have a name, and we do 2 things:

If 2 instances share the same name, we say they're equal, through __eq__ (and talk a bit).
Through __hash__, we set the hash of instances to be solely based on their name (and talk a bit).



In [5]:

    
class Verbose(object):
    
    def __init__(self, name):
        self.name = name
    
    def __eq__(self, other):
        """
        __eq__(for "equal") is used to determine
        the result of:
        Verbose() == other
        """
        equal = self.name == other.name
        print("Is {} equal to {} ? {}.".format(self, other, equal))
        return equal
    
    def __hash__(self):
        """
        Hashable objects need to implement __hash__,
        that should return integers, with this
        property: 2 equal objects should return
        the same hash.
        """
        hash_value = hash(self.name)
        print("Asking the hash for {} which is {}.".format(self, hash_value))
        return hash_value
        
    def __repr__(self):
        """
        This is just so that our prints looks clean
        """
        return self.name

Play time !



In [6]:

    
Verbose("a") == Verbose("b")









    



Is a equal to b ? False.






    Out[6]:





False



In [7]:

    
{Verbose("a")}









    



Asking the hash for a which is -1212919011480583274.






    Out[7]:





{a}



In [8]:

    
{Verbose("a"), Verbose("b")}









    



Asking the hash for b which is 5265123888727380584.
Asking the hash for a which is -1212919011480583274.






    Out[8]:





{b, a}



In [9]:

    
{Verbose("a"), Verbose("a")}









    



Asking the hash for a which is -1212919011480583274.
Asking the hash for a which is -1212919011480583274.
Is a equal to a ? True.






    Out[9]:





{a}

When I put an object in a set, these things happen :

We compute the hash of the object
We see if the hash is already in our set
If so, we get the objects that share the same hash, they're potential matches
We wheck if our object is equal to any of that list.
If so, it's already in our set, so no need to add it again.

What would happen if 2 objects had the same hash, but were not equal ?



In [10]:

    
class Verbose2(Verbose):
    def __hash__(self):
        return 1



In [11]:

    
{
    Verbose2("a"), Verbose2("b"), Verbose2("c"),
    Verbose2("d"), Verbose2("e"), Verbose2("f")
}









    



Is f equal to e ? False.
Is f equal to d ? False.
Is e equal to d ? False.
Is f equal to c ? False.
Is e equal to c ? False.
Is d equal to c ? False.
Is f equal to b ? False.
Is e equal to b ? False.
Is d equal to b ? False.
Is c equal to b ? False.
Is f equal to a ? False.
Is e equal to a ? False.
Is d equal to a ? False.
Is c equal to a ? False.
Is b equal to a ? False.






    Out[11]:





{f, a, c, b, e, d}

Thats 15 comparisons for 6 objects. Each object gets compared to all the others. The number of comparison is 6 * 5 / 2



In [12]:

    
n = 6



In [13]:

    
def number_of_comparisons(n):
    return n * (n - 1) // 2

number_of_comparisons(6)









    Out[13]:





15

So this is what is happening for our initial bad example : there are 10 000 objects that all get compared to each other.



In [14]:

    
number_of_comparisons(10000)









    Out[14]:





49995000

That's ~50 millions.

At this point, you might ask :

"But you said elusively in your earliest list that 'We see if the hash is already in our set'. Isn't it the same problem ? We have to check all the values !"

And yes, if a hash could be anything, that would be true. Fortunately, we know something about hashes : they're integers. A comparison between any objects can give 2 results : equal or not equal. A comparison between integers can give 3 results : less, equal, greater.

What happens behind the scene may depend on the python implemention. Better than telling you things that Python does that we cannot test here, let's give an example of what a Python implementation might do.

Python could place all our values in a Binary Tree (btree). A binary tree is a way to organize sortable elements that look a lot like the game where you have to guess someone's number when they only tell you plus or minus. This allows for very efficient fetching of your data (log2(n))

So that's it, then

Now we know why the A class is fast and the B class is so very very slow.

Thank you very much. The show's finished. What ? You want more ? Ok, let's get you more !

What happens when the hash is not really reliable ?



In [15]:

    
import random
class C(object):
    def __hash__(self):
        return random.randint(0, 10)



In [16]:

    
c = C()
s = {c}
print(
    "Is c in the set of s that contains only c ?",
    c in s
)
print(
    "And what if I ask you 20 times the question ?",
    any(c in s for i in range(20))
)









    



Is c in the set of s that contains only c ? False
And what if I ask you 20 times the question ? True

Why do they say mutable objects should not be hashable ?

(Mutable objects mean objects that you can modify after they've been created). Hashes are computed only when the object is first put in a set.



In [17]:

    
a = Verbose("a")
s = {a}
a.name = "b"
s.add(Verbose("b"))









    



Asking the hash for a which is -1212919011480583274.
Asking the hash for b which is 5265123888727380584.



In [18]:

    
s









    Out[18]:





{b, b}

We've mutated our object a to a b after we put it in a set, and now the set contains 2 equal b objects. We've broken our set ! So that's why lists and dicts (which are mutables) cannot be put in sets, but tuples and frozensets (which are immutable) can.

Do we really need hash AND eq to compare objects ? isn't hash alone enough ?

Given that there's a lower and upper limit to the hashes integer values, there are not lots of possible hashes compared to the number of different objects that might exist, so there are not enough integers to give a unique one to all the possible objects in universe. Here are a few examples :



In [19]:

    
hash(1.000000000000001) == hash(2561)









    Out[19]:





True



In [20]:

    
hash(object) == hash(hash(object))









    Out[20]:





True

So, would you use Python if you knew that sometimes, you get a random object disappearing from a set or a dict just because it has the same hash as an existing object ? You'd probably run away, and I'd do too. But it's ok, Python has our back. \o/

[Update : ] Is this a security issue ?

Well it can be. I didn't knew about this until @rami showed me, but, yes, this can be quite a security issue because if it can help an attacker to bring your server on its knees. See the video and the discussion on the subject.

Conclusion

That's all, folks. Thanks a lot to @rixx for the inspiration of writing this as a blog post.

I'm all ears for suggestions and improvements regarding both the content and the format of this. I might do it again later.

See ya !

Joachim Jablon

(Creative Commons : BY-NC)