Running This Notebook

This notebook should be run using the github/mdtok container on DockerHub. The Dockerfile that defines this container is located at the root of this repository named: cpu.Dockerfile

This will ensure that you are able to run this notebook properly as many of the dependencies in this project are rapidly changing. To run this notebook using this container, the commands are:

Get the container: docker pull github\mdtok

Run the container: docker run --it --net=host -v <host_dir>:/ds github/mdtok bash


In [1]:
from mdparse.parser import transform_pre_rules, compose
import pandas as pd
from tqdm import tqdm_notebook
from fastai.text.transform import defaults

Source of Data

The GHArchive project ingests large amounts of data from GitHub repositories. This data is stored in BigQuery for public consumption.

For this project, we gathered over 18 million GitHub issues by executing this query. This query attempts to remove duplicate issues where the content of the issue is roughly the same.

This query results in over 18 Million GitHub issues. The results of this query are split into 100 csv files for free download on the following Google Cloud Storage Bucket:

https://storage.googleapis.com/issue_label_bot/language_model_data/0000000000{00-99}.csv.gz, each file contains approximately 180,000 issues and is 55MB compressed.

Preview Data

Download Sample

The below dataframe illustrates what the format of the raw data looks like:


In [35]:
df = pd.read_csv(f'https://storage.googleapis.com/issue_label_bot/language_model_data/000000000000.csv.gz').sample(5)

df.head(1)


Out[35]:
url repo title title_length body body_length
4354 https://github.com/codegangsta/cli/issues/405 codegangsta/cli Help does not display command aliases 39 Hello, xxxlnbrk First I would like to say that your project is awesome and it is the best CLI library I have used! Keep up the good work! Thank you for that. xxxlnbrk I have a question related to latest version. Recently I have noticed that when displaying help it does no longer show aliases next to the command names. xxxlnbrk When command is defined in the following way, no aliases are shown in the help output xxxlnbrk `{Name: xxxdblqtenginx-startxxxdblqte, xxxlnbrk Aliases: []string{xxxdblqtengstartxxxdblqte, xxxdblqtenstartxxxdblqte}, xxxlnbrk Category: cat(NginxCategory),}` xxxlnbrk When deprecated ShortName is used instead Aliases, help does display alias properly. xxxlnbrk `{ xxxlnbrk Name: xxxdblqtenginx-startxxxdblqte, xxxlnbrk ShortName: xxxdblqtengstartxxxdblqte, xxxlnbrk Category: cat(NginxCategory), xxxlnbrk ... xxxlnbrk }` xxxlnbrk Is this change in behaviour intentional and we should define a custom help template or did it simply slipped during refactoring? 945

Illustrate Markdown Parsing Using mdparse

mdparse is a library that parses markdown text and annotates the text with fields with meta-data for deep learning. Below is an illustration of mdparse at work. The parsed and annotated text can be seen in the clean_body field:

The changes are often subtle, but can make a big difference with regard to feature extraction for language modeling.


In [43]:
pd.set_option('max_colwidth', 1000)

df['clean_body'] = ''
for i, b in tqdm_notebook(enumerate(df.body), total=len(df)):
    try:
        df['clean_body'].iloc[i] = compose(transform_pre_rules+defaults.text_pre_rules)(b)
    except:
        print(f'error at: {i}')
        break
        
df[['body', 'clean_body']]


/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py:190: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Out[43]:
body clean_body
4354 Hello, xxxlnbrk First I would like to say that your project is awesome and it is the best CLI library I have used! Keep up the good work! Thank you for that. xxxlnbrk I have a question related to latest version. Recently I have noticed that when displaying help it does no longer show aliases next to the command names. xxxlnbrk When command is defined in the following way, no aliases are shown in the help output xxxlnbrk `{Name: xxxdblqtenginx-startxxxdblqte, xxxlnbrk Aliases: []string{xxxdblqtengstartxxxdblqte, xxxdblqtenstartxxxdblqte}, xxxlnbrk Category: cat(NginxCategory),}` xxxlnbrk When deprecated ShortName is used instead Aliases, help does display alias properly. xxxlnbrk `{ xxxlnbrk Name: xxxdblqtenginx-startxxxdblqte, xxxlnbrk ShortName: xxxdblqtengstartxxxdblqte, xxxlnbrk Category: cat(NginxCategory), xxxlnbrk ... xxxlnbrk }` xxxlnbrk Is this change in behaviour intentional and we should define a custom help template or did it simply slipped du... Hello, First I would like to say that your project is awesome and it is the best CLI library I have used! Keep up the good work! Thank you for that. I have a question related to latest version. Recently I have noticed that when displaying help it does no longer show aliases next to the command names. \n When command is defined in the following way, no aliases are shown in the help output xxxcdb xxxjson xxxcde \n When deprecated ShortName is used instead Aliases, help does display alias properly. xxxcdb xxxjson xxxcde \n Is this change in behaviour intentional and we should define a custom help template or did it simply slipped during refactoring?
52889 Like height, custom colors (e.g. adding alpha), vibration length and custom layouts, because not including <> makes this keyboard rather not THAT useful for me. Like height, custom colors (e.g. adding alpha), vibration length and custom layouts, because not including <> makes this keyboard rather not THAT useful for me.
151496 As a user, I can see a list of all the Open tasks (created by security guard) so I can take select any of those tasks to complete it. xxxlnbrk xxxlnbrk I should see a list of entries made by Security guard with trailer's location as 'Open.' xxxlnbrk xxxlnbrk As a user I can see following details on my dashboard: xxxlnbrk xxxlnbrk - Trailer Number xxxlnbrk - Status xxxlnbrk - Trip Number xxxlnbrk - Shunt Pending (Yes/No) xxxlnbrk - Confirm Shunt (Yes/No) xxxlnbrk - Zone = Open xxxlnbrk - Notes xxxlnbrk - Seal Number xxxlnbrk xxxlnbrk # Acceptance Tests xxxlnbrk xxxlnbrk * Verify that a shunt driver user is getting following options: xxxlnbrk >* Assign zone xxxlnbrk >* Raise conflict xxxlnbrk xxxlnbrk * Verify that a Shunt Driver can sort the list using any of the below mentioned parameters: xxxlnbrk xxxlnbrk * Verify that the entries made by Security User are appearing under Shunt Driver's dashboard with Zone as open. xxxlnbrk xxxlnbrk * Verify that entries with Zone as Op... As a user, I can see a list of all the Open tasks (created by security guard) so I can take select any of those tasks to complete it. \n I should see a list of entries made by Security guard with trailer's location as 'Open.' \n As a user I can see following details on my dashboard: \n xxxlistB Trailer Number Status Trip Number Shunt Pending (Yes / No) Confirm Shunt (Yes / No) Zone = Open Notes Seal Number xxxlistE \n xxxhl Acceptance Tests \n xxxlistB Verify that a shunt driver user is getting following options: \n Assign zone Raise conflict Verify that a Shunt Driver can sort the list using any of the below mentioned parameters: \n Verify that the entries made by Security User are appearing under Shunt Driver's dashboard with Zone as open. \n Verify that entries with Zone as Open will get displayed under separate list \n xxxlistE
168828 When rendering a simple markdown with latex equation as html xxxlnbrk Simple reproducible example: xxxlnbrk `$$\\hat{f}(x) = \\sum\\limits_{i=1}^N \\hat{\\alpha_i}\\; y_i\\; K(x,x_i) + \\hat{\\beta_i}$$` xxxlnbrk VS-code produces: xxxlnbrk ```html xxxlnbrk <script type=xxxdblqtetext/x-mathjax-configxxxdblqte> xxxlnbrk MathJax.Hub.Config({xxxdblqteextensionsxxxdblqte:[xxxdblqtetex2jax.jsxxxdblqte],xxxdblqtejaxxxxdblqte:[xxxdblqteinput/TeXxxxdblqte,xxxdblqteoutput/HTML-CSSxxxdblqte],xxxdblqtemessageStylexxxdblqte:xxxdblqtenonexxxdblqte,xxxdblqtetex2jaxxxxdblqte:{xxxdblqteprocessEnvironmentsxxxdblqte:false,xxxdblqteprocessEscapesxxxdblqte:true,xxxdblqteinlineMathxxxdblqte:[[xxxdblqte$xxxdblqte,xxxdblqte$xxxdblqte],[xxxdblqte\\\\(xxxdblqte,xxxdblqte\\\\)xxxdblqte]],xxxdblqtedisplayMathxxxdblqte:[[xxxdblqte$$xxxdblqte,xxxdblqte$$xxxdblqte],[xxxdblqte\\\\[xxxdblqte,xxxdblqte\\\\]xxxdblqte]]},xxxdblqteTeXxxxdblqte:{xxxdblqteextensionsxxxdblqte:[xxxdblqteAMSmath.jsxxxdblqte,xxxd... When rendering a simple markdown with latex equation as html Simple reproducible example: xxxcdb $$ \ \ hat xxxjson (x) = \ \ sum \ \ limits_ xxxjson ^N \ \ hat xxxjson \ \ ; y_i \ \ ; K(x,x_i) + \ \ hat xxxjson $$ xxxcde \n VS-code produces: \n xxxcdb lang-html xxxhtml xxxhtml xxxhtml < / script> \n xxxcde \n The last particular statement: \n xxxcdb " root " : " file: xxxfilepath " xxxcde \n Makes it not rendering correctly in Firefox. It works in chrome. VS-code works in both browsers.
40010 ### Versions and Environment xxxlnbrk **Vuetify:** 1.2.1 xxxlnbrk **Vue:** 2.5.17 xxxlnbrk **Browsers:** Chrome 68.0.3440.106 xxxlnbrk **OS:** Mac OS 10.12.6 xxxlnbrk ### Steps to reproduce xxxlnbrk Position the mouse pointer exactly between two checkboxes and click, or try checking a checkbox by clicking in the lower half of a checkbox. xxxlnbrk ### Expected Behavior xxxlnbrk One would expect the checkbox to be checked where the mouse pointer is, i.e the tip of the arrow or the end of the index finger if the icon is a hand. This also happens in HTML checkboxes. xxxlnbrk ### Actual Behavior xxxlnbrk The checkbox to the bottom of the mouse pointer is selected. Either because the checkbox seems to correspond to the center of the mouse index, or the surface area of the checkbox is too large and overlaps other check box areas since it does not scale when heigh values lower. This also happens on other low value heights. I suspect the latter one to be the issue, as checkboxes can also ... xxxhm Versions and Environment \n Vuetify: 1.2.1 Vue: 2.5.17 Browsers: Chrome xxunk OS: Mac OS 10.12.6 \n xxxhm Steps to reproduce \n Position the mouse pointer exactly between two checkboxes and click, or try checking a checkbox by clicking in the lower half of a checkbox. \n xxxhm Expected Behavior \n One would expect the checkbox to be checked where the mouse pointer is, i.e the tip of the arrow or the end of the index finger if the icon is a hand. This also happens in HTML checkboxes. \n xxxhm Actual Behavior \n The checkbox to the bottom of the mouse pointer is selected. Either because the checkbox seems to correspond to the center of the mouse index, or the surface area of the checkbox is too large and overlaps other check box areas since it does not scale when heigh values lower. This also happens on other low value heights. I suspect the latter one to be the issue, as checkboxes can also be checked when clicking slightly outside of the checkbox. \n xxxhm Reproduction Link \...

Download And Pre-Process Data

We download the data from GCP and pre-process this data before saving to disk.


In [2]:
from fastai.text.transform import ProcessPoolExecutor, partition_by_cores
import numpy as np
from fastai.core import parallel
from itertools import chain

In [3]:
transforms = transform_pre_rules + defaults.text_pre_rules

In [4]:
def process_dict(dfdict, _):
    """process the data, but allow failure."""
    t = compose(transforms)
    title = dfdict['title']
    body = dfdict['body']
    try:
        text = 'xxxfldtitle '+ t(title) + ' xxxfldbody ' + t(body)
    except:
        return None
    return {'url': dfdict['url'], 'text':text}


def download_data(i, _):
    """Since the data is in 100 chunks already, just do the processing by chunk."""
    fn = f'https://storage.googleapis.com/issue_label_bot/language_model_data/{str(i).zfill(12)}.csv.gz'
    dicts = [process_dict(d, 0) for d in pd.read_csv(fn).to_dict(orient='rows')]
    df = pd.DataFrame([d for d in dicts if d])
    df.to_csv(f'/ds/IssuesLanguageModel/data/1_processed_csv/processed_part{str(i).zfill(4)}.csv', index=False)
    return df

Note: The below procedure took over 30 hours on a p3.8xlarge instance on AWS with 32 Cores and 64GB of Memory. You may have to change the number of workers based on your memory and compute constraints.


In [5]:
dfs = parallel(download_data, list(range(100)), max_workers=31)

In [13]:
dfs_rows = sum([x.shape[0] for x in dfs])
print(f'number of rows in pre-processed data: {dfs_rows:,}')


number of rows in pre-processed data: 18,620,833

In [14]:
del dfs

Cached pre-processed data

Since ~19M GitHub issues take a long time to pre-process, the pre-processed files are available here:

https://storage.googleapis.com/issue_label_bot/pre_processed_data/1_processed_csv/processed_part00{00-99}.csv

Partition Data Into Train/Validation Set

Set aside random 10 files (out of 100) as the Validation set


In [15]:
from pathlib import Path
from random import shuffle

# shuffle the files
p = Path('/ds/IssuesLanguageModel/data/1_processed_csv/')
files = p.ls()
shuffle(files)

# show a preview of files
files[:5]


Out[15]:
[PosixPath('/ds/IssuesLanguageModel/data/1_processed_csv/processed_part0039.csv'),
 PosixPath('/ds/IssuesLanguageModel/data/1_processed_csv/processed_part0095.csv'),
 PosixPath('/ds/IssuesLanguageModel/data/1_processed_csv/processed_part0097.csv'),
 PosixPath('/ds/IssuesLanguageModel/data/1_processed_csv/processed_part0029.csv'),
 PosixPath('/ds/IssuesLanguageModel/data/1_processed_csv/processed_part0064.csv')]

In [16]:
valid_df = pd.concat([pd.read_csv(f) for f in files[:10]]).dropna().drop_duplicates()
train_df = pd.concat([pd.read_csv(f) for f in files[10:]]).dropna().drop_duplicates()

print(f'rows in train_df:, {train_df.shape[0]:,}')
print(f'rows in valid_df:, {valid_df.shape[0]:,}')


rows in train_df:, 16,762,799
rows in valid_df:, 1,858,033

In [17]:
valid_df.to_pickle('/ds/IssuesLanguageModel/data/2_partitioned_df/valid_df.pkl')
train_df.to_pickle('/ds/IssuesLanguageModel/data/2_partitioned_df/train_df.pkl')

Location of Train/Validaiton DataFrames

You can download the above saved dataframes (in pickle format) from Google Cloud Storage:

train_df.pkl (9GB):

https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/train_df.pkl

valid_df.pkl (1GB)

https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/valid_df.pkl


In [ ]: