In [3]:
from tethne.readers import wos
datapath = '/Users/erickpeirson/Downloads/datasets/wos/genecol* OR common garden 1-500.txt'
corpus = wos.read(datapath)
In [4]:
from tethne.networks import features
We can use index_feature() to tokenize the abstract into individual words.
In [19]:
corpus.index_feature('abstract', tokenize=lambda x: x.split(' '))
Here are all of the papers whose abstracts contain the word 'arthropod':
In [20]:
abstractTerms = corpus.features['abstract']
In [63]:
abstractTerms.papers_containing('arthropod')
Out[63]:
The transform method allows us to transform the values from one featureset using a custom function. One popular transformation for wordcount data is the term frequency inverse document frequency (tf\idf) transformation. tf*idf weights wordcounts for each document based on how frequent each word is in the rest of the corpus, and is supposed to bring to the foreground the words that are the most "important" for each document.
In [28]:
from math import log
def tfidf(f, c, C, DC):
"""
Apply the term frequency * inverse document frequency transformation.
"""
tf = float(c)
idf = log(float(len(abstractTerms.features))/float(DC))
return tf*idf
In [29]:
corpus.features['abstracts_tfidf'] = abstractTerms.transform(tfidf)
I can specify some other transformation by first defining a transformer function, and then passing it as an argument to transform. A transformer function should accept the following parameters, and return a single numerical value (int or float).
| Parameter | Description |
|---|---|
f |
Representation of the feature (e.g. string). |
v |
Value of the feature in the document (e.g. frequency). |
C |
Value of the feature in the Corpus (e.g. global frequency). |
DC |
Number of documents in which the feature occcurs. |
For example:
In [21]:
def mytransformer(s, c, C, DC):
"""
Doubles the feature value and divides by the overall value in the Corpus.
"""
return c*2./(C)
We can then pass transformer function to transform as the first positional argument.
In [22]:
corpus.features['abstracts_transformed'] = abstractTerms.transform(mytransformer)
Here is the impact on the value for 'arthropod' in one document, using the two transformations above.
In [62]:
print 'Before: '.ljust(15), corpus.features['abstract'].features['WOS:000324532900018'].value('arthropod')
print 'TF*IDF: '.ljust(15), corpus.features['abstracts_tfidf'].features['WOS:000324532900018'].value('arthropod')
print 'mytransformer: '.ljust(15), corpus.features['abstracts_transformed'].features['WOS:000324532900018'].value('arthropod')
We can also use transform() to remove words from our FeatureSet. For example, we can apply the NLTK stoplist and remove too-common or too-rare words:
In [67]:
from nltk.corpus import stopwords
stoplist = stopwords.words()
def apply_stoplist(f, v, c, dc):
if f in stoplist or dc > 50 or dc < 3:
return 0
return v
In [68]:
corpus.features['abstracts_filtered'] = corpus.features['abstracts_tfidf'].transform(apply_stoplist)
In [72]:
print 'Before: '.ljust(10), len(corpus.features['abstracts_tfidf'].index)
print 'After: '.ljust(10), len(corpus.features['abstracts_filtered'].index)
The mutual_information function in the features module generates a network based on the pointwise mutual information of each pair of features in a featureset.
The first argument is a list of Papers, just like most other network-building functions. The second argument is the featureset that we wish to use.
In [78]:
MI_graph = features.mutual_information(corpus, 'abstracts_filtered', min_weight=0.7)
Take a look at the ratio of nodes to edges to get a sense of how to tune the min_weight parameter. If you have an extremely high number of edges for the number of nodes, then you should probably increase min_weight to obtain a more legible network. Depending on your field, you may have some guidance from theory as well.
In [79]:
print 'This graph has {0} nodes and {1} edges'.format(MI_graph.order(), MI_graph.size())
Once again, we'll use the GraphML writer to generate a visualizable network file.
In [80]:
from tethne.writers import graph
In [81]:
mi_outpath = '/Users/erickpeirson/Projects/tethne-notebooks/output/mi_graph.graphml'
In [82]:
graph.to_graphml(MI_graph, mi_outpath)
In [ ]: