In [15]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora
import urllib
from pprint import pprint
from nltk.corpus import stopwords
from collections import defaultdict

First start with a collection of 5 abstracts, manually scraped. Clean out the stop words and words that only appear once. Construct a dictionary and it has 75 words. The output is the words in dictionary and their corresponding integer IDs.


In [ ]:
doc1="Electron acceleration in a post-flare decimetric continuum source Prasad Subramanian, S. M. White, M. Karlický, R. Sych, H. S. Sawant, S. Ananthakrishnan(Submitted on 23 Mar 2007)Aims: To calculate the power budget for electron acceleration and the efficiency of the plasma emission mechanism in a post-flare decimetric continuum source. Methods: We have imaged a high brightness temperature (∼109K) post-flare source at 1060 MHz with the Giant Metrewave Radio Telescope (GMRT). We use information from these images and the dynamic spectrum from the Hiraiso spectrograph together with the theoretical method described in Subramanian & Becker (2006) to calculate the power input to the electron acceleration process. The method assumes that the electrons are accelerated via a second-order Fermi acceleration mechanism. Results: We find that the power input to the nonthermal electrons is in the range 3×1025--1026 erg/s. The efficiency of the overall plasma emission process starting from electron acceleration and culminating in the observed emission could range from 2.87×10−9 to 2.38×10−8."

doc2="Local (shearing box) simulations of the nonlinear evolution of the magnetorotational instability in a collisionless plasma show that angular momentum transport by pressure anisotropy (p⊥≠p∥, where the directions are defined with respect to the local magnetic field) is comparable to that due to the Maxwell and Reynolds stresses. Pressure anisotropy, which is effectively a large-scale viscosity, arises because of adiabatic invariants related to p⊥ and p∥ in a fluctuating magnetic field. In a collisionless plasma, the magnitude of the pressure anisotropy, and thus the viscosity, is determined by kinetic instabilities at the cyclotron frequency. Our simulations show that ∼50 % of the gravitational potential energy is directly converted into heat at large scales by the viscous stress (the remaining energy is lost to grid-scale numerical dissipation of kinetic and magnetic energy). We show that electrons receive a significant fraction (∼[Te/Ti]1/2) of this dissipated energy. Employing this heating by an anisotropic viscous stress in one dimensional models of radiatively inefficient accretion flows, we find that the radiative efficiency of the flow is greater than 0.5% for M˙≳10−4M˙Edd. Thus a low accretion rate, rather than just a low radiative efficiency, is necessary to explain the low luminosity of many accreting black holes. For Sgr A* in the Galactic Center, our predicted radiative efficiencies imply an accretion rate of ≈3×10−8M⊙yr−1 and an electron temperature of ≈3×1010 K at ≈10 Schwarzschild radii; the latter is consistent with the brightness temperature inferred from VLBI observations."

doc3="We review the theory of electron-conduction opacity, a fundamental ingredient in the computation of low-mass stellar models; shortcomings and limitations of the existing calculations used in stellar evolution are discussed. We then present new determinations of the electron-conduction opacity in stellar conditions for an arbitrary chemical composition, that improve over previous works and, most importantly, cover the whole parameter space relevant to stellar evolution models (i.e., both the regime of partial and high electron degeneracy). A detailed comparison with the currently used tabulations is also performed. The impact of our new opacities on the evolution of low-mass stars is assessed by computing stellar models along both the H- and He-burning evolutionary phases, as well as Main Sequence models of very low-mass stars and white dwarf cooling tracks."

doc4="The best measurement of the cosmic ray positron flux available today was performed by the HEAT balloon experiment more than 10 years ago. Given the limitations in weight and power consumption for balloon experiments, a novel approach was needed to design a detector which could increase the existing data by more than a factor of 100. Using silicon photomultipliers for the readout of a scintillating fiber tracker and of an imaging electromagnetic calorimeter, the PEBS detector features a large geometrical acceptance of 2500 cm^2 sr for positrons, a total weight of 1500 kg and a power consumption of 600 W. The experiment is intended to measure cosmic ray particle spectra for a period of up to 20 days at an altitude of 40 km circulating the North or South Pole. A full Geant 4 simulation of the detector concept has been developed and key elements have been verified in a testbeam in October 2006 at CERN."

doc5="The fluorescence detection of ultra high energy (> 10^18 eV) cosmic rays requires a detailed knowledge of the fluorescence light emission from nitrogen molecules, which are excited by the cosmic ray shower particles along their path in the atmosphere. We have made a precise measurement of the fluorescence light spectrum excited by MeV electrons in dry air. We measured the relative intensities of 34 fluorescence bands in the wavelength range from 284 to 429 nm with a high resolution spectrograph. The pressure dependence of the fluorescence spectrum was also measured from a few hPa up to atmospheric pressure. Relative intensities and collisional quenching reference pressures for bands due to transitions from a common upper level were found in agreement with theoretical expectations. The presence of argon in air was found to have a negligible effect on the fluorescence yield. We estimated that the systematic uncertainty on the cosmic ray shower energy due to the pressure dependence of the fluorescence spectrum is reduced to a level of 1% by the AIRFLY results presented in this paper."

documents=[doc1,doc2,doc3,doc4,doc5]

# remove words from the stopwords and tokenize
#stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stopwords.words('english')] for document in documents]
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
#texts contain all the key words
texts = [[token for token in text if frequency[token] > 1] for text in texts]
#save texts as a dictionary
dictionary = corpora.Dictionary(texts)
dictionary.save('firstdic.dict')  # store the dictionary, for future reference
#print(dictionary)
#look at the unique integer IDs for the 75 words
print(dictionary.token2id)

Now test our small dictionary on a new abstract. It returns with a vector that represents [[word ID, frequency]]


In [ ]:
#test a new document
doc6="This paper presents the effects of electron-positron pair production on the linear growth of the resistive hose instability of a filamentary beam that could lead to snake-like distortion. For both the rectangular radial density profile and the diffuse profile reflecting the Bennett-type equilibrium for a self-collimating flow, the modified eigenvalue equations are derived from a Vlasov-Maxwell equation. While for the simple rectangular profile, current perturbation is localized at the sharp radial edge, for the realistic Bennett profile with an obscure edge, it is non-locally distributed over the entire beam, removing catastrophic wave-particle resonance. The pair production effects likely decrease the betatron frequency, and expand the beam radius to increase the resistive decay time of the perturbed current; these also lead to a reduction of the growth rate. It is shown that, for the Bennett profile case, the characteristic growth distance for a preferential mode can exceed the observational length-scale of astrophysical jets. This might provide the key to the problem of the stabilized transport of the astrophysical jets including extragalactic jets up to Mpc (∼3×1024 cm) scales."
new_vec = dictionary.doc2bow(doc6.lower().split())
print(new_vec)

Now use the arxiv API instead of the manual scraping; try it on one abstract and see how it works:


In [14]:
#this is how to grab the summary from each api link
url = 'http://export.arxiv.org/api/query?search_query=all:electron&start=0&max_results=1'
data=urllib.request.urlopen(url).read()
#datastring=str(data,'utf-8')
datasummary=str(data,'utf-8').split("<summary>",1)[1].split('</summary',1)[0]
#convert the bytes to string, split out the summary
print(datasummary)


  The effect of the electron-electron cusp on the convergence of configuration
interaction (CI) wave functions is examined. By analogy with the
pseudopotential approach for electron-ion interactions, an effective
electron-electron interaction is developed which closely reproduces the
scattering of the Coulomb interaction but is smooth and finite at zero
electron-electron separation. The exact many-electron wave function for this
smooth effective interaction has no cusp at zero electron-electron separation.
We perform CI and quantum Monte Carlo calculations for He and Be atoms, both
with the Coulomb electron-electron interaction and with the smooth effective
electron-electron interaction. We find that convergence of the CI expansion of
the wave function for the smooth electron-electron interaction is not
significantly improved compared with that for the divergent Coulomb interaction
for energy differences on the order of 1 mHartree. This shows that, contrary to
popular belief, description of the electron-electron cusp is not a limiting
factor, to within chemical accuracy, for CI calculations.

Now try our small dictionary on 10 abstracts:


In [40]:
for articleN in range(0,10):
    url = 'http://export.arxiv.org/api/query?search_query=all:electron&start='+str(articleN)+'&max_results=1'
    print(url)
    data=urllib.request.urlopen(url).read()
    datasummary=str(data,'utf-8').split("<summary>",1)[1].split('</summary',1)[0]
    vec=dictionary.doc2bow(datasummary.lower().split())
    print(vec)


http://export.arxiv.org/api/query?search_query=all:electron&start=0&max_results=1
[(0, 1), (29, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=1&max_results=1
[(2, 1), (3, 4), (5, 3), (7, 1), (36, 2)]
http://export.arxiv.org/api/query?search_query=all:electron&start=2&max_results=1
[(3, 2), (5, 2), (29, 1), (34, 1), (44, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=3&max_results=1
[(1, 5), (3, 7), (12, 1), (14, 1), (15, 1), (19, 1), (29, 5), (33, 1), (34, 1), (39, 1), (65, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=4&max_results=1
[(5, 2), (6, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=5&max_results=1
[(0, 1), (5, 2), (27, 1), (35, 3), (41, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=6&max_results=1
[(3, 3), (5, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=7&max_results=1
[(6, 4), (12, 1), (27, 1), (33, 1), (43, 1), (70, 1), (74, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=8&max_results=1
[(3, 6), (5, 4), (51, 1), (53, 1), (65, 1), (67, 1)]
http://export.arxiv.org/api/query?search_query=all:electron&start=9&max_results=1
[(3, 1), (5, 2), (19, 1), (36, 2), (67, 1)]

Now build a class to stream the corpus:


In [36]:
class MyCorpus(object):
    def _iter_(self):
        for articleN in range(0,5):
            url='http://export.arxiv.org/api/query?search_query=all:electron&start='+str(articleN)+'&max_results=1'
            data=urllib.request.urlopen(url).read()
            datasummary=str(data,'utf-8').split("<summary>",1)[1].split('</summary',1)[0]
            yield dictionary.doc2bow(datasummary.lower().split())

In [37]:
corpus_memory_friendly=MyCorpus()
print(corpus_memory_friendly)


<__main__.MyCorpus object at 0x113bbae48>

In [38]:
for vector in corpus_memory_friendly:
    print(vector)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-6efa9281a2e7> in <module>()
----> 1 for vector in corpus_memory_friendly:
      2     print(vector)

TypeError: 'MyCorpus' object is not iterable

In [24]:


In [30]:


In [31]:


In [ ]:


In [28]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:
class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'): 
            #assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())