Topic Modeling wiht Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a method for reducing the dimnesionality of documents treated as a bag of words. It is used for document classification, clustering and retrieval. For example, LSA can be used to search for prior art given a new patent application. In this homework, we will implement a small library for simple latent semantic analysis as a practical example of the application of SVD. The ideas are very similar to PCA.

We will implement a toy example of LSA to get familiar with the ideas. If you want to use LSA or similar methods for statiscal language analyis, the most efficient Python library is probably gensim - this also provides an online algorithm - i.e. the training information can be continuously updated. Other useful functions for processing natural language can be found in the Natural Lnaguage Toolkit.

Note: The SVD from scipy.linalg performs a full decomposition, which is inefficient since we only need to decompose until we get the first k singluar values. If the SVD from scipy.linalg is too slow, please use the sparsesvd function from the sparsesvd package to perform SVD instead. You can install in the usual way with

!pip install sparsesvd

Then import the following

from sparsesvd import sparsesvd 
from scipy.sparse import csc_matrix

and use as follows

sparsesvd(csc_matrix(M), k=10)

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.linalg as la
import scipy.stats as st

Exercise 1 (10 points). Calculating pairwise distance matrices.

Suppose we want to construct a distance matrix between the rows of a matrix. For example, given the matrix

M = np.array([[1,2,3],[4,5,6]])

the distance matrix using Euclidean distance as the measure would be

[[ 0.000  1.414  2.828]
 [ 1.414  0.000  1.414]
 [ 2.828  1.414  0.000]]

if $M$ was a collection of column vectors.

Write a function to calculate the pairwise-distance matrix given the matrix $M$ and some arbitrary distance function. Your functions should have the following signature:

def func_name(M, distance_func):
    pass
  1. Write a distance function for the Euclidean, squared Euclidean and cosine measures.
  2. Write the function using looping for M as a collection of row vectors.
  3. Write the function using looping for M as a collection of column vectors.
  4. Wrtie the function using broadcasting for M as a colleciton of row vectors.
  5. Write the function using broadcasting for M as a colleciton of column vectors.

For 3 and 4, try to avoid using transposition (but if you get stuck, there will be no penalty for using transpoition). Check that all four functions give the same result when applied to the given matrix $M$.


In [2]:
# Questions 1.1 to 1.5

def squared_euclidean_norm(u, axis=-1):
    return (u**2).sum(axis)

def euclidean_norm(u, axis=-1):
    return np.sqrt(squared_euclidean_norm(u, axis))

def squared_euclidean_dist(u, v, axis=-1):
    """Returns squared Euclidean distance between two vectors."""
    return squared_euclidean_norm(u-v, axis)

def euclidean_dist(u, v, axis=-1):
    """Return Euclidean distacne between two vectors."""
    return np.sqrt(squared_euclidean_dist(u, v, axis))
    
def cosine_dist(u, v, axis=-1):
    """Returns cosine of angle betwwen two vectors."""
    # return 1 - np.dot(u, v)/(la.norm(u)*la.norm(v))
    return 1 - (u * v).sum(axis)/(euclidean_norm(u, axis) * euclidean_norm(v, axis))

def loop_row_pdist(M, f):
    """REturns pairwise-distance matrix assuming M consists of row vectors.."""
    nrows, ncols = M.shape
    return np.array([[f(M[u,:], M[v,:]) for u in range(nrows)] 
                                        for v in range(nrows)])

def loop_col_pdist(M, f):
    """REturns pairwise-distance matrix assuming M consists of column vectors.."""
    nrows, ncols = M.shape
    return np.array([[f(M[:,u], M[:,v]) for u in range(ncols)] 
                                        for v in range(ncols)])

def broadcast_row_pdist(M, f):
    """REturns pairwise-distance matrix assuming M consists of row vectors.."""
    return f(M[None,:,:], M[:,None,:])

def broadcast_col_pdist(M, f):
    """REturns pairwise-distance matrix assuming M consists of column vectors.."""
    return f(M[:,None,:], M[:,:,None], axis=0)

In [4]:
# Q1 checking reuslts

M = np.array([[1,2,3],[4,5,6]])

# dist = euclidean_dist
for dist in (cosine_dist, euclidean_dist, squared_euclidean_dist):
    print(loop_row_pdist(M, dist), '\n')
    print(broadcast_row_pdist(M, dist), '\n')
    print(loop_col_pdist(M, dist), '\n')
    print(broadcast_col_pdist(M, dist))


[[  0.00000000e+00   2.53681538e-02]
 [  2.53681538e-02   2.22044605e-16]] 

[[  0.00000000e+00   2.53681538e-02]
 [  2.53681538e-02   2.22044605e-16]] 

[[  0.00000000e+00   9.16983196e-03   2.38129398e-02]
 [  9.16983196e-03  -2.22044605e-16   3.45424176e-03]
 [  2.38129398e-02   3.45424176e-03   1.11022302e-16]] 

[[  0.00000000e+00   9.16983196e-03   2.38129398e-02]
 [  9.16983196e-03  -2.22044605e-16   3.45424176e-03]
 [  2.38129398e-02   3.45424176e-03   1.11022302e-16]]
[[ 0.          5.19615242]
 [ 5.19615242  0.        ]] 

[[ 0.          5.19615242]
 [ 5.19615242  0.        ]] 

[[ 0.          1.41421356  2.82842712]
 [ 1.41421356  0.          1.41421356]
 [ 2.82842712  1.41421356  0.        ]] 

[[ 0.          1.41421356  2.82842712]
 [ 1.41421356  0.          1.41421356]
 [ 2.82842712  1.41421356  0.        ]]
[[ 0 27]
 [27  0]] 

[[ 0 27]
 [27  0]] 

[[0 2 8]
 [2 0 2]
 [8 2 0]] 

[[0 2 8]
 [2 0 2]
 [8 2 0]]

Exercise 2 (20 points). Write 3 functions to calculate the term frequency (tf), the inverse document frequency (idf) and the product (tf-idf). Each function should take a single argument docs, which is a dictionary of (key=identifier, value=dcoument text) pairs, and return an appropriately sized array. Convert '-' to ' ' (space), remove punctuation, convert text to lowercase and split on whitespace to generate a collection of terms from the dcoument text.

  • tf = the number of occurrences of term $i$ in document $j$
  • idf = $\log \frac{n}{1 + \text{df}_i}$ where $n$ is the total number of documents and $\text{df}_i$ is the number of documents in which term $i$ occurs.

Print the table of tf-idf values for the following document collection

s1 = "The quick brown fox"
s2 = "Brown fox jumps over the jumps jumps jumps"
s3 = "The the the lazy dog elephant."
s4 = "The the the the the dog peacock lion tiger elephant"

docs = {'s1': s1, 's2': s2, 's3': s3, 's4': s4}

In [11]:
# The tf() function is optional - it can also be coded directly into tfs()

# Questino 2.1
def tf(doc):
    """Returns the number of times each term occurs in a dcoument.
    We preprocess the document to strip punctuation and convert to lowercase.
    Terms are found by splitting on whitespace."""
    from collections import Counter
    from string import punctuation

    table = dict.fromkeys(map(ord, punctuation))
    terms = doc.lower().replace('-', ' ').translate(table).split()
    return Counter(terms)

def tfs(docs):
    """Create a term freqeuncy dataframe from a dictionary of documents."""
    from operator import add

    df = pd.DataFrame({k: tf(v) for k, v in docs.items()}).fillna(0)
    return df

# Question 2.2
def idf(docs):
    """Find inverse document frequecny series from a dictionry of doucmnets."""
    term_freq = tfs(docs)
    num_docs = len(docs)
    doc_freq = (term_freq > 0).sum(axis=1)
    return np.log(num_docs/(1 + doc_freq))

# Question 2.3
def tf_idf(docs):
    """Return the product of the term-frequency and inverse document freqeucny."""
    return tfs(docs).mul(idf(docs), axis=0)

In [12]:
# Question 2.4

s1 = "The quick brown fox"
s2 = "Brown fox jumps over the jumps jumps jumps"
s3 = "The the the lazy dog elephant."
s4 = "The the the the the dog peacock lion tiger elephant"

docs = {'s1': s1, 's2': s2, 's3': s3, 's4': s4}

tf_idf(docs)


Out[12]:
s1 s2 s3 s4
brown 0.287682 0.287682 0.000000 0.000000
dog 0.000000 0.000000 0.287682 0.287682
elephant 0.000000 0.000000 0.287682 0.287682
fox 0.287682 0.287682 0.000000 0.000000
jumps 0.000000 2.772589 0.000000 0.000000
lazy 0.000000 0.000000 0.693147 0.000000
lion 0.000000 0.000000 0.000000 0.693147
over 0.000000 0.693147 0.000000 0.000000
peacock 0.000000 0.000000 0.000000 0.693147
quick 0.693147 0.000000 0.000000 0.000000
the -0.223144 -0.223144 -0.669431 -1.115718
tiger 0.000000 0.000000 0.000000 0.693147

Exercise 3 (20 points).

  1. Write a function that takes a matrix $M$ and an integer $k$ as arguments, and reconstructs a reduced matrix using only the $k$ largest singular values. Use the scipy.linagl.svd function to perform the decomposition. This is the least squares approximation to the matrix $M$ in $k$ dimensions.

  2. Apply the function you just wrote to the following term-frequency matrix for a set of $9$ documents using $k=2$ and print the reconstructed matrix $M'$.

    M = np.array([[1, 0, 0, 1, 0, 0, 0, 0, 0],
        [1, 0, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 1, 0, 1, 0, 0, 0, 0],
        [0, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 1, 0, 0, 0, 0],
        [0, 1, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 1, 1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 0],
        [0, 0, 0, 0, 0, 0, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 1, 1]])
  3. Calculate the pairwise correlation matrix for the original matrix M and the reconstructed matrix using $k=2$ singular values (you may use scipy.stats.spearmanr to do the calculations). Consider the fist 5 sets of documents as one group $G1$ and the last 4 as another group $G2$ (i.e. first 5 and last 4 columns). What is the average within group correlation for $G1$, $G2$ and the average cross-group correlation for G1-G2 using either $M$ or $M'$. (Do not include self-correlation in the within-group calculations.).


In [13]:
# Question 3.1

def svd_projection(M, k):
    """Returns the matrix M reconstructed using only k singluar values"""
    U, s, V = la.svd(M, full_matrices=False)
    s[k:] = 0
    M_ = U.dot(np.diag(s).dot(V))
    
    try:
        return pd.DataFrame(M_, index=M.index, columns=M.columns)
    except AttributeError:
        return M_

In [14]:
# Qeustion 3.2

M = np.array([[1, 0, 0, 1, 0, 0, 0, 0, 0],
    [1, 0, 1, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 1, 0, 1, 0, 0, 0, 0],
    [0, 1, 1, 2, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0, 0, 1],
    [0, 0, 0, 0, 0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0, 0, 1, 1, 1],
    [0, 0, 0, 0, 0, 0, 0, 1, 1]])

Md = svd_projection(M, 2)
Md


Out[14]:
array([[ 0.16205797,  0.40049828,  0.37895454,  0.46756626,  0.17595367,
        -0.05265495, -0.11514284, -0.15910198, -0.09183827],
       [ 0.14058529,  0.36980077,  0.32899603,  0.40042722,  0.16497247,
        -0.03281545, -0.07056857, -0.09676827, -0.04298073],
       [ 0.15244948,  0.50500444,  0.35793658,  0.41010678,  0.23623173,
         0.02421652,  0.05978051,  0.0868573 ,  0.12396632],
       [ 0.25804933,  0.84112343,  0.60571995,  0.69735717,  0.39231795,
         0.03311801,  0.08324491,  0.12177239,  0.18737973],
       [ 0.44878975,  1.23436483,  1.0508615 ,  1.26579559,  0.55633139,
        -0.07378998, -0.15469383, -0.20959816, -0.04887954],
       [ 0.15955428,  0.5816819 ,  0.37521897,  0.41689768,  0.27654052,
         0.05590374,  0.1322185 ,  0.18891146,  0.21690761],
       [ 0.15955428,  0.5816819 ,  0.37521897,  0.41689768,  0.27654052,
         0.05590374,  0.1322185 ,  0.18891146,  0.21690761],
       [ 0.21846278,  0.54958058,  0.51096047,  0.62805802,  0.24253607,
        -0.06541098, -0.14252146, -0.19661186, -0.1079133 ],
       [ 0.09690639,  0.53206438,  0.22991365,  0.21175363,  0.26652513,
         0.13675618,  0.31462078,  0.44444058,  0.42496948],
       [-0.06125388,  0.23210821, -0.1388984 , -0.26564589,  0.14492549,
         0.24042105,  0.54614717,  0.7673742 ,  0.66370933],
       [-0.06467702,  0.33528115, -0.14564055, -0.30140607,  0.20275641,
         0.30572612,  0.69489337,  0.97661121,  0.84874969],
       [-0.04308204,  0.25390566, -0.09666695, -0.20785821,  0.1519134 ,
         0.22122703,  0.50294488,  0.70691163,  0.6155044 ]])

In [15]:
# Question 3.3

# Results for full-rank matrix (not graded - just here for comparison)
rho, pval = st.spearmanr(M)
np.mean(rho[:5, :5][np.tril_indices_from(rho[:5, :5], 1)]), \
np.mean(rho[5:, 5:][np.tril_indices_from(rho[5:, 5:], 1)]), \
rho[5:, :5].mean()


Out[15]:
(0.26427723987536589, 0.66269679816762439, -0.30756218890559084)

In [16]:
rho


Out[16]:
array([[ 1.        , -0.19245009,  0.        ,  0.07339758, -0.33333333,
        -0.17407766, -0.25819889, -0.33333333, -0.33333333],
       [-0.19245009,  1.        ,  0.        , -0.12712835,  0.57735027,
        -0.30151134, -0.4472136 , -0.57735027, -0.19245009],
       [ 0.        ,  0.        ,  1.        ,  0.43822991,  0.        ,
        -0.21320072, -0.31622777, -0.40824829, -0.40824829],
       [ 0.07339758, -0.12712835,  0.43822991,  1.        , -0.33028913,
        -0.17248787, -0.25584086, -0.33028913, -0.33028913],
       [-0.33333333,  0.57735027,  0.        , -0.33028913,  1.        ,
        -0.17407766, -0.25819889, -0.33333333, -0.33333333],
       [-0.17407766, -0.30151134, -0.21320072, -0.17248787, -0.17407766,
         1.        ,  0.67419986,  0.52223297, -0.17407766],
       [-0.25819889, -0.4472136 , -0.31622777, -0.25584086, -0.25819889,
         0.67419986,  1.        ,  0.77459667,  0.25819889],
       [-0.33333333, -0.57735027, -0.40824829, -0.33028913, -0.33333333,
         0.52223297,  0.77459667,  1.        ,  0.55555556],
       [-0.33333333, -0.19245009, -0.40824829, -0.33028913, -0.33333333,
        -0.17407766,  0.25819889,  0.55555556,  1.        ]])

In [17]:
# Results after LSA (graded)
# G1/G1, G2/G2 and G1/G2 average correlation

rho, pval = st.spearmanr(Md)
np.mean(rho[:5, :5][np.tril_indices_from(rho[:5, :5], 1)]), \
np.mean(rho[5:, 5:][np.tril_indices_from(rho[5:, 5:], 1)]), \
rho[5:, :5].mean()


Out[17]:
(0.89879963065558643, 0.99354491662183997, -0.67775935811735)

In [18]:
rho


Out[18]:
array([[ 1.        ,  0.84561404,  1.        ,  1.        ,  0.71929825,
        -0.83712913, -0.83712913, -0.83712913, -0.80210281],
       [ 0.84561404,  1.        ,  0.84561404,  0.84561404,  0.97192982,
        -0.55691854, -0.55691854, -0.55691854, -0.47986063],
       [ 1.        ,  0.84561404,  1.        ,  1.        ,  0.71929825,
        -0.83712913, -0.83712913, -0.83712913, -0.80210281],
       [ 1.        ,  0.84561404,  1.        ,  1.        ,  0.71929825,
        -0.83712913, -0.83712913, -0.83712913, -0.80210281],
       [ 0.71929825,  0.97192982,  0.71929825,  0.71929825,  1.        ,
        -0.38879219, -0.38879219, -0.38879219, -0.29772375],
       [-0.83712913, -0.55691854, -0.83712913, -0.83712913, -0.38879219,
         1.        ,  1.        ,  1.        ,  0.97902098],
       [-0.83712913, -0.55691854, -0.83712913, -0.83712913, -0.38879219,
         1.        ,  1.        ,  1.        ,  0.97902098],
       [-0.83712913, -0.55691854, -0.83712913, -0.83712913, -0.38879219,
         1.        ,  1.        ,  1.        ,  0.97902098],
       [-0.80210281, -0.47986063, -0.80210281, -0.80210281, -0.29772375,
         0.97902098,  0.97902098,  0.97902098,  1.        ]])

Exercise 4 (40 points). Clustering with LSA

  1. Begin by loading a pubmed database of selected article titles using 'pickle'. With the following: import pickle docs = pickle.load(open('pubmed.pic', 'rb'))

    Create a tf-idf matrix for every term that appears at least once in any of the documents. What is the shape of the tf-idf matrix?

  2. Perform SVD on the tf-idf matrix to obtain $U \Sigma V^T$ (often written as $T \Sigma D^T$ in this context with $T$ representing the terms and $D$ representing the documents). If we set all but the top $k$ singular values to 0, the reconstructed matrix is essentially $U_k \Sigma_k V_k^T$, where $U_k$ is $m \times k$, $\Sigma_k$ is $k \times k$ and $V_k^T$ is $k \times n$. Terms in this reduced space are represented by $U_k \Sigma_k$ and documents by $\Sigma_k V^T_k$. Reconstruct the matrix using the first $k=10$ singular values.

  3. Use agglomerative hierachical clustering with complete linkage to plot a dendrogram and comment on the likely number of document clusters with $k = 100$. Use the dendrogram function from SciPy .

  4. Determine how similar each of the original documents is to the new document mystery.txt. Since $A = U \Sigma V^T$, we also have $V = A^T U S^{-1}$ using orthogonality and the rule for transposing matrix products. This suggests that in order to map the new document to the same concept space, first find the tf-idf vector $v$ for the new document - this must contain all (and only) the terms present in the existing tf-idx matrix. Then the query vector $q$ is given by $v^T U_k \Sigma_k^{-1}$. Find the 10 documents most similar to the new document and the 10 most dissimilar.


In [25]:
# Quesiton 4.1

import pickle

docs = pickle.load(open('pubmed.pic', 'rb'))
df = tf_idf(docs)
df.shape


Out[25]:
(6488, 178)

In [26]:
# Question 4.2

k = 10
T, s, D = la.svd(df)

print(T.shape, s.shape, D.shape, '\n')

df_10 = T[:,:k].dot(np.diag(s[:k]).dot(D[:k,:]))
assert(df.shape == df_10.shape)
df_10


(6488, 6488) (178,) (178, 178) 

Out[26]:
array([[ 0.04426728, -0.05128911,  0.17685338, ...,  0.02251461,
         0.12676269, -0.21022814],
       [ 0.00293911,  0.04269075,  0.00837922, ...,  0.0354431 ,
         0.02936161,  0.09236818],
       [ 0.00236582, -0.0758258 ,  0.0661755 , ..., -0.04051041,
        -0.00907675, -0.18027799],
       ..., 
       [ 0.00796169,  0.00864913,  0.01581232, ...,  0.00579137,
         0.02599572,  0.01220511],
       [ 0.03053506,  0.12239512,  0.07019758, ...,  0.09256041,
         0.12828975,  0.19820361],
       [ 0.00819589,  0.01629815,  0.01721203, ...,  0.01149927,
         0.03127815,  0.0262137 ]])

In [29]:
# Question 4.2 (alternative solution 1 setting unwanted singluar values to zero)

T, s, D = la.svd(df, full_matrices=False)
print(T.shape, s.shape, D.shape, '\n')

s[10:] = 0
df_10 = T.dot(np.diag(s).dot(D))
assert(df.shape == df_10.shape)
df_10


(6488, 178) (178,) (178, 178) 

Out[29]:
array([[ 0.04426728, -0.05128911,  0.17685338, ...,  0.02251461,
         0.12676269, -0.21022814],
       [ 0.00293911,  0.04269075,  0.00837922, ...,  0.0354431 ,
         0.02936161,  0.09236818],
       [ 0.00236582, -0.0758258 ,  0.0661755 , ..., -0.04051041,
        -0.00907675, -0.18027799],
       ..., 
       [ 0.00796169,  0.00864913,  0.01581232, ...,  0.00579137,
         0.02599572,  0.01220511],
       [ 0.03053506,  0.12239512,  0.07019758, ...,  0.09256041,
         0.12828975,  0.19820361],
       [ 0.00819589,  0.01629815,  0.01721203, ...,  0.01149927,
         0.03127815,  0.0262137 ]])

In [30]:
! pip install sparsesvd


Requirement already satisfied: sparsesvd in /Users/cliburn/anaconda2/envs/p3/lib/python3.5/site-packages
Requirement already satisfied: scipy>=0.6.0 in /Users/cliburn/anaconda2/envs/p3/lib/python3.5/site-packages (from sparsesvd)
Requirement already satisfied: cython in /Users/cliburn/anaconda2/envs/p3/lib/python3.5/site-packages (from sparsesvd)
Requirement already satisfied: numpy>=1.7.1 in /Users/cliburn/anaconda2/envs/p3/lib/python3.5/site-packages (from scipy>=0.6.0->sparsesvd)

In [32]:
# Question 4.2 (alternative solution 2 using sparsesvd)

from scipy.sparse import csc_matrix 
from sparsesvd import sparsesvd 

k = 10
T, s, D = sparsesvd(csc_matrix(df), k=k)

print(T.shape, s.shape, D.shape, '\n')
df_10 = T.T.dot(np.diag(s).dot(D))
assert(df.shape == df_10.shape)
print(df_10)


(10, 6488) (10,) (10, 178) 

[[ 0.04426728 -0.05128911  0.17685338 ...,  0.02251461  0.12676269
  -0.21022814]
 [ 0.00293911  0.04269075  0.00837922 ...,  0.0354431   0.02936161
   0.09236818]
 [ 0.00236582 -0.0758258   0.0661755  ..., -0.04051041 -0.00907675
  -0.18027799]
 ..., 
 [ 0.00796169  0.00864913  0.01581232 ...,  0.00579137  0.02599572
   0.01220511]
 [ 0.03053506  0.12239512  0.07019758 ...,  0.09256041  0.12828975
   0.19820361]
 [ 0.00819589  0.01629815  0.01721203 ...,  0.01149927  0.03127815
   0.0262137 ]]

In [33]:
# Question 4.3

from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist, squareform

plt.figure(figsize=(16,36))
T, s, D = sparsesvd(csc_matrix(df), k=100)
x = np.diag(s).dot(D).T
data_dist = pdist(x, metric='cosine') # computing the distance
data_link = linkage(data_dist) # computing the linkage
labels = [c[:40] for c in df.columns[:]]
dendrogram(data_link, orientation='right', labels=labels);



In [35]:
# Quesiton 4.4

k = 10
T, s, D = sparsesvd(csc_matrix(df), k=100)

doc = {'mystery': open('mystery.txt').read()}
terms = tf_idf(doc)
query_terms = df.join(terms).fillna(0)['mystery']
q = query_terms.T.dot(T.T.dot(np.diag(1.0/s)))

ranked_docs = df.columns[np.argsort(cosine_dist(q, x))][::-1]
print("Query article:", )
print(' '.join(line.strip() for line in doc['mystery'].splitlines()[:2]))
print()
print("Most similar")
print('='*80)
for i, title in enumerate(ranked_docs[:10]):
    print('%03d' % i, title)

print()
print("Most dissimilar")
print('='*80)
for i, title in enumerate(ranked_docs[-10:]):
    print('%03d' % (len(docs) - i), title)


Query article:
Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes

Most similar
================================================================================
000 Diabetes Numeracy and Blood Glucose Control: Association With Type of Diabetes and Source of Care.
001 Feasibility of the SMART Project: A Text Message Program for Adolescents With Type 1 Diabetes.
002 Health Care Utilization Among U.S. Adults With Diagnosed Diabetes, 2013.
003 Demographic Disparities Among Medicare Beneficiaries with Type 2 Diabetes Mellitus in 2011: Diabetes Prevalence, Comorbidities, and Hypoglycemia Events.
004 Disparities in Postpartum Follow-Up in Women With Gestational Diabetes Mellitus.
005 Prevalence and Determinants of Anemia in Older People With Diabetes Attending an Outpatient Clinic: A Cross-Sectional Audit.
006 Outcomes of a Diabetes Education Program for Registered Nurses Caring for Individuals With Diabetes.
007 Gestational Diabetes Mellitus Screening Using the One-Step Versus Two-Step Method in a High-Risk Practice.
008 Evaluating the toxic and beneficial effects of lichen extracts in normal and diabetic rats.
009 Efficacy and Safety of Saxagliptin as Add-On Therapy in Type 2 Diabetes.

Most dissimilar
================================================================================
178 Phenotypic profiling of CD8 + T cells during Plasmodium vivax blood-stage infection.
177 ERK1/2 promoted proliferation and inhibited apoptosis of human cervical cancer cells and regulated the expression of c-Fos and c-Jun proteins.
176 Avian haemosporidians from Neotropical highlands: Evidence from morphological and molecular data.
175 Nerve Growth Factor Potentiates Nicotinic Synaptic Transmission in Mouse Airway Parasympathetic Neurons.
174 Dopamine Increases CD14+CD16+ Monocyte Migration and Adhesion in the Context of Substance Abuse and HIV Neuropathogenesis.
173 Crystal Structures of the Carboxyl cGMP Binding Domain of the Plasmodium falciparum cGMP-dependent Protein Kinase Reveal a Novel Capping Triad Crucial for Merozoite Egress.
172 Antibodies to the Plasmodium falciparum proteins MSPDBL1 and MSPDBL2 opsonise merozoites, inhibit parasite growth and predict protection from clinical malaria.
171 CD4 T-cell subsets in malaria: TH1/TH2 revisited.
170 CD40 Is Required for Protective Immunity against Liver Stage Plasmodium Infection.
169 IRGM3 contributes to immunopathology and is required for differentiation of antigen-specific effector CD8+ T cells in experimental cerebral malaria.

Notes on the Pubmed articles

These were downloaded with the following script.

from Bio import Entrez, Medline
Entrez.email = "YOUR EMAIL HERE"
import pickle

try:
    docs = pickle.load(open('pubmed.pic', 'rb'))
except Exception, e:
    print e

    docs = {}
    for term in ['plasmodium', 'diabetes', 'asthma', 'cytometry']:
        handle = Entrez.esearch(db="pubmed", term=term, retmax=50)
        result = Entrez.read(handle)
        handle.close()
        idlist = result["IdList"]
        handle2 = Entrez.efetch(db="pubmed", id=idlist, rettype="medline", retmode="text")
        result2 = Medline.parse(handle2)
        for record in result2:
            title = record.get("TI", None)
            abstract = record.get("AB", None)
            if title is None or abstract is None:
                continue
            docs[title] = '\n'.join([title, abstract])
            print(title)
        handle2.close()
    pickle.dump(docs, open('pubmed.pic', 'wb'))
docs.values()

In [ ]: