Web snapshots of Google Scholar searches were saved to a Zotero Group Library. Each snapshot is a single page of the search results.
From each of the saved web snapshots extract the following:
In [1]:
import os
from os.path import join, basename, splitext
import re
from glob import glob
from zipfile import ZipFile
from tqdm import tqdm
from pyzotero import zotero
from bs4 import BeautifulSoup
import pandas as pd
from lib.secrets import WEB_SNAPSHOTS, USER_KEY
In [2]:
output_dir = join('data', 'attachments')
We connect to this shared Zotero library of Web Snapshots of Google Scholar searches.
In [3]:
zot = zotero.Zotero(WEB_SNAPSHOTS, 'group', USER_KEY)
Given a collection of Google Scholar searches in a Zotero collection library we want to get the saved Google Scholar web page from that search. In this case the Google Scholar searches were saved in a group library called Web Snapshots. Within that library there are several collections. Each collection represents the Google Scholar web pages returned by a particular search term like "biodiversity database". Each search term may return multiple pages. There will be one attachment in the collection for each web page. So there will be one attachment in the collection for the first page returned from the" biodiversity database" search (results 1-10) and another attachment for the the second page (11-20), and so on.
Here we:
1) Loop through every search term (or collection).
2) For every search term we loop through every page (or attachment) returned by that search term.
3) We download the attachment which be a zip file that will contain data than we need for this notebook. So, in the next section we will extract what we need.
In [4]:
all_attachments = []
for collection in zot.collections():
attachments = [a for a
in zot.everything(zot.collection_items(collection['key']))
if a['data']['itemType'] == 'attachment']
for attachment in tqdm(attachments, desc='attachments'):
all_attachments.append(attachment)
zot.dump(attachment['key'], '{}'.format(attachment['key']), output_dir)
In [5]:
pattern = join(output_dir, '*.zip')
target = 'scholar.html'
src = join(output_dir, target)
zip_files = glob(pattern)
for zip_file in tqdm(zip_files, desc='zip files'):
with ZipFile(zip_file) as zippy:
name_list = zippy.namelist()
if target not in name_list:
continue
zippy.extract(target, output_dir)
base_name = basename(zip_file)
file_name = splitext(base_name)[0]
dst = join(output_dir, '{}.html'.format(file_name))
os.rename(src, dst)
In [6]:
pattern = join(output_dir, '*.html')
html_files = glob(pattern)
all_docs = []
for html_file in tqdm(html_files, desc='html files'):
with open(html_file) as in_file:
page = in_file.read()
soup = BeautifulSoup(page, 'html.parser')
base_name = basename(html_file)
key = splitext(base_name)[0]
for result in soup.select('div.gs_r'):
# Title
title_obj = result.select_one('.gs_rt a')
if not title_obj:
continue
title = title_obj.get_text()
# This contains several fields which are extracted later
author_string = result.select_one('.gs_a').get_text()
# First author
authors = author_string.split('-')[0]
first_author = authors.split(',')[0].strip()
# Publication year
match = re.search('\d{4}', author_string)
publication_year = match.group(0) if match else ''
# Citations
citation_string = result.find(text=re.compile(r'Cited by \d+'))
citations = 0
if citation_string:
match = re.search('\d+', citation_string)
citations = int(match.group(0)) if match else '0'
all_docs.append({
'key': key,
'title': title,
'first_author': first_author,
'publication_year': publication_year,
'citations': citations})
In [7]:
csv_path = join(output_dir, 'citations_v3.csv')
df = pd.DataFrame(all_docs)
df.shape
df.to_csv(csv_path, index=False)
In [ ]: