Extract Information from the Web Snapshots

Web snapshots of Google Scholar searches were saved to a Zotero Group Library. Each snapshot is a single page of the search results.

From each of the saved web snapshots extract the following:

  • Title
  • First author
  • Number of citations
  • Zotero key
  • Publication year

In [1]:
import os
from os.path import join, basename, splitext
import re
from glob import glob
from zipfile import ZipFile

from tqdm import tqdm
from pyzotero import zotero
from bs4 import BeautifulSoup
import pandas as pd

from lib.secrets import WEB_SNAPSHOTS, USER_KEY

In [2]:
output_dir = join('data', 'attachments')

Connect to Zotero

We connect to this shared Zotero library of Web Snapshots of Google Scholar searches.


In [3]:
zot = zotero.Zotero(WEB_SNAPSHOTS, 'group', USER_KEY)

Loop through Zotero Library and Save the Web Pages to a Work Directory (Zip File)

Given a collection of Google Scholar searches in a Zotero collection library we want to get the saved Google Scholar web page from that search. In this case the Google Scholar searches were saved in a group library called Web Snapshots. Within that library there are several collections. Each collection represents the Google Scholar web pages returned by a particular search term like "biodiversity database". Each search term may return multiple pages. There will be one attachment in the collection for each web page. So there will be one attachment in the collection for the first page returned from the" biodiversity database" search (results 1-10) and another attachment for the the second page (11-20), and so on.

Here we:

1) Loop through every search term (or collection).

2) For every search term we loop through every page (or attachment) returned by that search term.

3) We download the attachment which be a zip file that will contain data than we need for this notebook. So, in the next section we will extract what we need.


In [4]:
all_attachments = []

for collection in zot.collections():
    attachments = [a for a
                   in zot.everything(zot.collection_items(collection['key']))
                   if a['data']['itemType'] == 'attachment']
    for attachment in tqdm(attachments, desc='attachments'):
        all_attachments.append(attachment)
        zot.dump(attachment['key'], '{}'.format(attachment['key']), output_dir)

Extract the HTML Web Pages from the Zip Files

The zip files from the previous step contain more data than we need. We want to extract just the web page from the search. Here we loop through all of the zip files and extract the search web page.


In [5]:
pattern = join(output_dir, '*.zip')
target = 'scholar.html'
src = join(output_dir, target)

zip_files = glob(pattern)
for zip_file in tqdm(zip_files, desc='zip files'):

    with ZipFile(zip_file) as zippy:
        name_list = zippy.namelist()

        if target not in name_list:
            continue

        zippy.extract(target, output_dir)

        base_name = basename(zip_file)
        file_name = splitext(base_name)[0]
        dst = join(output_dir, '{}.html'.format(file_name))

        os.rename(src, dst)


zip files: 100%|██████████| 183/183 [00:00<00:00, 867.39it/s]

Extract Data from HTML Pages

Now we can extract the data from the saved web pages. We're using a Python library called BeautifulSoup4 for this.

Fields extracted from each search result:

  • Title
  • First author
  • Publication year
  • Citations

In [6]:
pattern = join(output_dir, '*.html')
html_files = glob(pattern)

all_docs = []

for html_file in tqdm(html_files, desc='html files'):

    with open(html_file) as in_file:
        page = in_file.read()

    soup = BeautifulSoup(page, 'html.parser')

    base_name = basename(html_file)
    key = splitext(base_name)[0]

    for result in soup.select('div.gs_r'):

        # Title
        title_obj = result.select_one('.gs_rt a')
        if not title_obj:
            continue

        title = title_obj.get_text()

        # This contains several fields which are extracted later
        author_string = result.select_one('.gs_a').get_text()

        # First author
        authors = author_string.split('-')[0]
        first_author = authors.split(',')[0].strip()

        # Publication year
        match = re.search('\d{4}', author_string)
        publication_year = match.group(0) if match else ''

        # Citations
        citation_string = result.find(text=re.compile(r'Cited by \d+'))
        citations = 0
        if citation_string:
            match = re.search('\d+', citation_string)
            citations = int(match.group(0)) if match else '0'

        all_docs.append({
            'key': key,
            'title': title,
            'first_author': first_author,
            'publication_year': publication_year,
            'citations': citations})


html files: 100%|██████████| 183/183 [00:18<00:00,  9.30it/s]

Write Results to CSV file


In [7]:
csv_path = join(output_dir, 'citations_v3.csv')

df = pd.DataFrame(all_docs)
df.shape
df.to_csv(csv_path, index=False)

In [ ]: