In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [15]:
from tethne.readers import wos
import pandas as pd
import networkx as nx
import numpy as np
import nltk
from helpers import filter_token, normalize_token
import gensim

In [3]:
metadata = wos.read('../data/Baldwin/PlantPhysiology', 
                    streaming=True, index_fields=['date', 'abstract'], index_features=['authors'])

In [4]:
abstracts = {}
for abstract, wosid in metadata.indices['abstract'].iteritems():
    print '\r', wosid[0],
    abstracts[wosid[0]] = abstract


WOS:000185974800027

In [10]:
nltk.tokenize.sent_tokenize(abstracts.items()[0][1])


Out[10]:
[u'Most of the iron in legume seeds is stored in ferritin located in the amyloplast, which is used during seed germination.',
 u'However, there is a lack of information on the regulation of iron by phytoferritin.',
 u'In this study, soluble and insoluble forms of pea (Pisum sativum) seed ferritin (PSF) isolated from dried seeds were found to be identical 24-mer ferritins comprising H-1 and H-2 subunits.',
 u'The insoluble form is favored at low pH, whereas the two forms reversibly interconvert in the pH range of 6.0 to 7.8, with an apparent pK(a) of 6.7.',
 u'This phenomenon was not observed in animal ferritins, indicating that PSF is unique.',
 u'The pH of the amyloplast was found to be approximately 6.0, thus facilitating PSF association, which is consistent with the role of PSF in long-term iron storage.',
 u'Similar to previous studies, the results of this work showed that protein degradation occurs in purified PSF during storage, thus proving that phytoferritin also undergoes degradation during seedling germination.',
 u'In contrast, no degradation was observed in animal ferritins, suggesting that this degradation of phytoferritin may be due to the extension peptide (EP), a specific domain found only in phytoferritin.',
 u'Indeed, removal of EP from PSF significantly increased protein stability and prevented degradation under identical conditions while promoting protein dissociation.',
 u'Correlated with such dissociation was a considerable increase in the rate of ascorbate-induced iron release from PSF at pH 6.0.',
 u'Thus, phytoferritin may have facilitated the evolution of EP to enable it to regulate iron for storage or complement in seeds.']

In [19]:
documents = []
for wosid, abstract in abstracts.iteritems():
    for sentence in nltk.tokenize.sent_tokenize(abstract):
        documents.append([normalize_token(token) 
                          for token in nltk.tokenize.word_tokenize(sentence)
                          if filter_token(token)])

In [20]:
documents[0]


Out[20]:
[u'iron',
 u'legume',
 u'seed',
 u'stored',
 u'ferritin',
 u'located',
 u'amyloplast',
 u'used',
 u'seed',
 u'germination']

In [21]:
model = gensim.models.Word2Vec(documents, size=200, window=5)

In [25]:
model.most_similar(positive=['transport', 'water'])


Out[25]:
[(u'flow', 0.7300856113433838),
 (u'hydraulic', 0.7193525433540344),
 (u'net', 0.6972233057022095),
 (u'uptake', 0.6961174011230469),
 (u'blockage', 0.692240297794342),
 (u'mpa', 0.6783999800682068),
 (u'transpirational', 0.6728108525276184),
 (u'diffusion', 0.6726493835449219),
 (u'efflux', 0.6726212501525879),
 (u'evaporative', 0.6641442775726318)]

In [ ]: