This notebook should be run using the github/mdtok container on DockerHub. The Dockerfile that defines this container is located at the root of this repository named: cpu.Dockerfile
This will ensure that you are able to run this notebook properly as many of the dependencies in this project are rapidly changing. To run this notebook using this container, the commands are:
Get the container: docker pull github\mdtok
Run the container: docker run --it --net=host -v <host_dir>:/ds github/mdtok bash
In [1]:
from mdparse.parser import transform_pre_rules, compose
import pandas as pd
from tqdm import tqdm_notebook
from fastai.text.transform import defaults
The GHArchive project ingests large amounts of data from GitHub repositories. This data is stored in BigQuery for public consumption.
For this project, we gathered over 18 million GitHub issues by executing this query. This query attempts to remove duplicate issues where the content of the issue is roughly the same.
This query results in over 18 Million GitHub issues. The results of this query are split into 100 csv files for free download on the following Google Cloud Storage Bucket:
https://storage.googleapis.com/issue_label_bot/language_model_data/0000000000{00-99}.csv.gz
, each file contains approximately 180,000 issues and is 55MB compressed.
In [35]:
df = pd.read_csv(f'https://storage.googleapis.com/issue_label_bot/language_model_data/000000000000.csv.gz').sample(5)
df.head(1)
Out[35]:
mdparse
mdparse is a library that parses markdown text and annotates the text with fields with meta-data for deep learning. Below is an illustration of mdparse
at work. The parsed and annotated text can be seen in the clean_body
field:
The changes are often subtle, but can make a big difference with regard to feature extraction for language modeling.
In [43]:
pd.set_option('max_colwidth', 1000)
df['clean_body'] = ''
for i, b in tqdm_notebook(enumerate(df.body), total=len(df)):
try:
df['clean_body'].iloc[i] = compose(transform_pre_rules+defaults.text_pre_rules)(b)
except:
print(f'error at: {i}')
break
df[['body', 'clean_body']]
Out[43]:
In [2]:
from fastai.text.transform import ProcessPoolExecutor, partition_by_cores
import numpy as np
from fastai.core import parallel
from itertools import chain
In [3]:
transforms = transform_pre_rules + defaults.text_pre_rules
In [4]:
def process_dict(dfdict, _):
"""process the data, but allow failure."""
t = compose(transforms)
title = dfdict['title']
body = dfdict['body']
try:
text = 'xxxfldtitle '+ t(title) + ' xxxfldbody ' + t(body)
except:
return None
return {'url': dfdict['url'], 'text':text}
def download_data(i, _):
"""Since the data is in 100 chunks already, just do the processing by chunk."""
fn = f'https://storage.googleapis.com/issue_label_bot/language_model_data/{str(i).zfill(12)}.csv.gz'
dicts = [process_dict(d, 0) for d in pd.read_csv(fn).to_dict(orient='rows')]
df = pd.DataFrame([d for d in dicts if d])
df.to_csv(f'/ds/IssuesLanguageModel/data/1_processed_csv/processed_part{str(i).zfill(4)}.csv', index=False)
return df
Note: The below procedure took over 30 hours on a p3.8xlarge instance on AWS with 32 Cores and 64GB of Memory. You may have to change the number of workers based on your memory and compute constraints.
In [5]:
dfs = parallel(download_data, list(range(100)), max_workers=31)
In [13]:
dfs_rows = sum([x.shape[0] for x in dfs])
print(f'number of rows in pre-processed data: {dfs_rows:,}')
In [14]:
del dfs
Set aside random 10 files (out of 100) as the Validation set
In [15]:
from pathlib import Path
from random import shuffle
# shuffle the files
p = Path('/ds/IssuesLanguageModel/data/1_processed_csv/')
files = p.ls()
shuffle(files)
# show a preview of files
files[:5]
Out[15]:
In [16]:
valid_df = pd.concat([pd.read_csv(f) for f in files[:10]]).dropna().drop_duplicates()
train_df = pd.concat([pd.read_csv(f) for f in files[10:]]).dropna().drop_duplicates()
print(f'rows in train_df:, {train_df.shape[0]:,}')
print(f'rows in valid_df:, {valid_df.shape[0]:,}')
In [17]:
valid_df.to_pickle('/ds/IssuesLanguageModel/data/2_partitioned_df/valid_df.pkl')
train_df.to_pickle('/ds/IssuesLanguageModel/data/2_partitioned_df/train_df.pkl')
You can download the above saved dataframes (in pickle format) from Google Cloud Storage:
train_df.pkl (9GB):
https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/train_df.pkl
valid_df.pkl (1GB)
https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/valid_df.pkl
In [ ]: