In [3]:
from tethne.readers import wos
datapath = '/Users/erickpeirson/Downloads/datasets/wos/genecol* OR common garden 1-500.txt'
corpus = wos.read(datapath)

Networks of features based on co-occurrence

The features module in the tethne.networks subpackage contains a few functions for generating networks of features based on co-occurrence.


In [4]:
from tethne.networks import features

We can use index_feature() to tokenize the abstract into individual words.


In [19]:
corpus.index_feature('abstract', tokenize=lambda x: x.split(' '))

Here are all of the papers whose abstracts contain the word 'arthropod':


In [20]:
abstractTerms = corpus.features['abstract']

In [63]:
abstractTerms.papers_containing('arthropod')


Out[63]:
[u'WOS:000324532900018',
 u'WOS:000295132100020',
 u'WOS:000324408400012',
 u'WOS:000305342300080',
 u'WOS:000296938400006',
 u'WOS:000325555300008',
 u'WOS:000299058100016',
 u'WOS:000321823300005',
 u'WOS:000323699000010',
 u'WOS:000304300100004']

The transform method allows us to transform the values from one featureset using a custom function. One popular transformation for wordcount data is the term frequency inverse document frequency (tf\idf) transformation. tf*idf weights wordcounts for each document based on how frequent each word is in the rest of the corpus, and is supposed to bring to the foreground the words that are the most "important" for each document.


In [28]:
from math import log
def tfidf(f, c, C, DC):
    """
    Apply the term frequency * inverse document frequency transformation.
    """
    tf = float(c)
    idf = log(float(len(abstractTerms.features))/float(DC))
    return tf*idf

In [29]:
corpus.features['abstracts_tfidf'] = abstractTerms.transform(tfidf)

I can specify some other transformation by first defining a transformer function, and then passing it as an argument to transform. A transformer function should accept the following parameters, and return a single numerical value (int or float).

Parameter Description
f Representation of the feature (e.g. string).
v Value of the feature in the document (e.g. frequency).
C Value of the feature in the Corpus (e.g. global frequency).
DC Number of documents in which the feature occcurs.

For example:


In [21]:
def mytransformer(s, c, C, DC):
    """
    Doubles the feature value and divides by the overall value in the Corpus.
    """
    return c*2./(C)

We can then pass transformer function to transform as the first positional argument.


In [22]:
corpus.features['abstracts_transformed'] = abstractTerms.transform(mytransformer)

Here is the impact on the value for 'arthropod' in one document, using the two transformations above.


In [62]:
print 'Before: '.ljust(15), corpus.features['abstract'].features['WOS:000324532900018'].value('arthropod')
print 'TF*IDF: '.ljust(15), corpus.features['abstracts_tfidf'].features['WOS:000324532900018'].value('arthropod')
print 'mytransformer: '.ljust(15), corpus.features['abstracts_transformed'].features['WOS:000324532900018'].value('arthropod')


Before:         4
TF*IDF:         15.6240197324
mytransformer:  0.307692307692

We can also use transform() to remove words from our FeatureSet. For example, we can apply the NLTK stoplist and remove too-common or too-rare words:


In [67]:
from nltk.corpus import stopwords
stoplist = stopwords.words()

def apply_stoplist(f, v, c, dc):
    if f in stoplist or dc > 50 or dc < 3:
        return 0
    return v

In [68]:
corpus.features['abstracts_filtered'] = corpus.features['abstracts_tfidf'].transform(apply_stoplist)

In [72]:
print 'Before: '.ljust(10), len(corpus.features['abstracts_tfidf'].index)
print 'After: '.ljust(10), len(corpus.features['abstracts_filtered'].index)


Before:    20954
After:     4557

The mutual_information function in the features module generates a network based on the pointwise mutual information of each pair of features in a featureset.

The first argument is a list of Papers, just like most other network-building functions. The second argument is the featureset that we wish to use.


In [78]:
MI_graph = features.mutual_information(corpus, 'abstracts_filtered', min_weight=0.7)

Take a look at the ratio of nodes to edges to get a sense of how to tune the min_weight parameter. If you have an extremely high number of edges for the number of nodes, then you should probably increase min_weight to obtain a more legible network. Depending on your field, you may have some guidance from theory as well.


In [79]:
print 'This graph has {0} nodes and {1} edges'.format(MI_graph.order(), MI_graph.size())


This graph has 2271 nodes and 5595 edges

Once again, we'll use the GraphML writer to generate a visualizable network file.


In [80]:
from tethne.writers import graph

In [81]:
mi_outpath = '/Users/erickpeirson/Projects/tethne-notebooks/output/mi_graph.graphml'

In [82]:
graph.to_graphml(MI_graph, mi_outpath)


In [ ]: