Python is an interpretted programming language, which means you do not need to compile before you run as you would with Java or C++. You can use any text editor such as emacs, vim, sublime text, etc. for writing your python program. You must save your python program (script) using the .py extension. Then you can run your program by calling

$ python myprogram.py

Here, $ denotes the command prompt and named of the program is myprogram.py

We will use ipython-notebooks for demonstrating various functionalities in python in this module. But you can always write these commands in a script and run as described above. You should familiarize yourselves with ipython notebooks as well because it lets you quickly test python code snippets without having to writen entire python scripts.

Measuring Jaccard Similarity between two given Sentences


In [1]:
A = "I love data mining"
B = "I hate data mining"

First we will convert all charachters in the two sentences into lowercase using lowercase() method in strings. We will then split the sentences into words (tokens) using the split() method.


In [2]:
wordsA = A.lower().split()
wordsB = B.lower().split()

wordsA and wordsB are lists of words as shown below.


In [3]:
print wordsA


['i', 'love', 'data', 'mining']

In [4]:
print wordsB


['i', 'hate', 'data', 'mining']

Lets convert the two lists into sets so that we can compute the Jaccard coefficient between the two sets.


In [6]:
sA = set(wordsA)
sB = set(wordsB)

The Jaccard coefficient between two sets X and Y is defined as follows:

Jaccard(X,Y) = |X AND Y| / |X OR Y|


In [7]:
Jaccard = float(len(sA.intersection(sB))) / float(len(sA.union(sB)))

In [8]:
print Jaccard


0.6

intersection and union methods in set class can be used to compute A OR B, and A AND B. len() method returns the number of elements in a set. We must cast the return values to floats so that float division can happen.


In [ ]: