Corpus
One of the easiest ways to get started with allofplos is to use the Corpus
class.
But, you ask:
Why use the
Corpus
class?
It is a straightforward way to get back Article
objects from your corpus without needing to instantiate them one by one.
It also has handy utilities if you wanted to do more specific things that we're not going to get into.
How do I use it?
Eager, are we‽ I thought you'd never ask!
In [1]:
from allofplos import Corpus, starterdir
In [2]:
corpus = Corpus(starterdir)
In [3]:
len(corpus)
Out[3]:
In [4]:
display(corpus.random_article)
In [5]:
corpus['10.1371/journal.pcbi.1004141']
Out[5]:
In [6]:
display(corpus[0])
Or, you can do this with a slice of integers like you would access a list.
However, Articles can take up a lot of memory if you have (say) over 200,000 of them. To avoid memory overheads, this does not return a list, it returns a generator.
Below, we display every other article in first 10 in the corpus.
In [7]:
display(*(art for art in corpus[:10:2]))
In [8]:
for article in corpus:
print("doi:", article.doi, "journal:", article.journal)
You can use the corpus.random_sample()
method to get a random sample of articles from the corpus.
The best way to use this is by iterating through the random sample: for article in corpus.random_sample(x)
NB: It returns a generator (not a list) to avoid using too much memory.
In [9]:
for article in corpus.random_sample(50):
display(article)
Now you know the basics of using the Corpus
class.
Corpus(directory)
to a corpus directory on your file system. len(Corpus())
Corpus().random_article
.Corpus()[doi]
.Corpus()[0]
.x
articles in the corpus with Corpus()[:x]
for article in Corpus():
.x
random articles from the corpus with for article in Corpus().random_sample(x):
.Now it's time to check out the Article tutorial. Once it exists, we'll definitely link to it here.