First of all, we import the two modules we'll need to read the csv
file, and to use regular expressions:
In [1]:
import csv
import re
Then, we read the file, and store the columns Scientific Name
and Taxon Author
in two lists:
In [2]:
with open('../data/bee_list.txt') as f:
csvr = csv.DictReader(f, delimiter = '\t')
species = []
authors = []
for r in csvr:
species.append(r['Scientific Name'])
authors.append(r['Taxon Author'])
How many species?
In [3]:
len(species)
Out[3]:
In [4]:
len(authors)
Out[4]:
Pick one of the authors
element to use for testing. Choose one that is quite complicated, such as the 38th element:
In [5]:
au = authors[37]
In [6]:
au
Out[6]:
Now we need to build a regular expression. After some twiddling, you should end up with something like this, which captures the authors in one group, and the year in another group:
In [7]:
my_reg = re.compile(r'\(?([\w\s,\.\-\&]*),\s(\d{4})\)?')
# Translation
# \(? -> open parenthesis (or not)
# ([\w\s,\.\-\&]+) -> the first group is the list of authors
# which can contain \w (word character)
# \s (space) \. (dot) \- (dash) \& (ampersand)
# ,\s -> followed by comma and space
# (\d{4}) -> the second group is the year, 4 digits
# \)? -> potentially, close parenthesis
Test the expression
In [8]:
re.findall(my_reg,au)
Out[8]:
Now we write a function that uses the regular expression to extract an author list (useful when there are multiple authors), and the year
In [9]:
def extract_list_au_year(au):
tmp = re.match(my_reg, au)
authorlist = tmp.group(1)
year = tmp.group(2)
# split authors into a list using re.split
authorlist = re.split(', | \& ', authorlist)
# Translation: either separate using ', ' or ' & '
return [authorlist, year]
Let's see the output of this function:
In [10]:
extract_list_au_year(au)
Out[10]:
Finally, let's build two dictionaries:
In [11]:
dict_years = {}
dict_authors = {}
for au in authors:
tmp = extract_list_au_year(au)
for aunum in tmp[0]:
if aunum in dict_authors.keys():
dict_authors[aunum] = dict_authors[aunum] + 1
else:
dict_authors[aunum] = 1
if tmp[1] in dict_years.keys():
dict_years[tmp[1]] = dict_years[tmp[1]] + 1
else:
dict_years[tmp[1]] = 1
For example, these are all the authors:
In [12]:
dict_authors
Out[12]:
We use the following strategy:
In [13]:
max_value_author = max(dict_authors.values())
max_value_author
Out[13]:
In [14]:
which_index = list(dict_authors.values()).index(max_value_author)
which_index
Out[14]:
An the winner is:
In [15]:
list(dict_authors.keys())[which_index]
Out[15]:
We use the same strategy to find that the golden year of bee publication is:
In [16]:
max_value_year = max(dict_years.values())
which_index = list(dict_years.values()).index(max_value_year)
list(dict_years.keys())[which_index]
Out[16]: