Sometimes you want to read in data that contains different kinds of information per line. This could be a log file where some lines contain a timestamp and the log messages itself, but other lines could contain specific information about errors or stack traces.
The first thing that we need to accomplish is to read in the file into Pandas. There is a nice little trick that you can use to read almost every kind of structured data into a Pandas DataFrame. We just have to use a separator that reads in the content of a file line by line. We then have to mark each line by their meaning. We can to this by taking some Once we have the data into a DataFrame, we can use Pandas functionality to reshape the data as we need it.
For this example, I choose a Git log output from the Sprint PetClinic project. Such a dataset look like the following:
commit 75912a06c5613a2ea1305ad4d8ad6bc4be7765ce
Author: Stephane Nicoll <snicoll@pivotal.io>
Date: Fri Feb 17 12:30:57 2017 +0100
Polish contribution
Closes gh-229
commit 443d35eae23c874ed38305fbe75216339c41beaf
Author: Henri Tremblay <henri.tremblay@gmail.com>
Date: Thu Feb 16 15:08:30 2017 -0500
Put Ehcache back
This is a pretty challenging dataset: Albeit we have some lines that could be used as markers (like "commit:
", "Author
" and "Date:
"), we also have multiple lines without a marker text. Luckily, there is some convention here: The first line in this particular dataset is the so called "subject" or "title line", the other lines is the "full commit message". This makes it possible to differentiate between these two kind of information as well. So the data structure for one complete Git log entry looks like this (from doc):
commit <sha1>
Author: <author>
Date: <author date>
[empty]
<title line>
[empty]
<full commit message>
[empty]
It may seem impossible to read in that data in a Pandas DataFrame, but we'll see that it's relative easy to do once we know some basic techniques for treating these kind of semi-structured data.
Let's start by reading in the given dataset. Especially take a look at the choosen sep
parameter: Here, we use the newline operator \n
as separator. This puts every line from our dataset file into a separate data row of a DataFrame.
The parameter names
takes a array of column names that present the headers for the DataFrame. This is necessary for our specific dataset because it doesn't have any specific header at the top of the file.
We also make use of the skipinitialspace
parameter. This will not delete the whitespace in front of the first characters. In our case, this is very useful because the commit's subject and message lines are prefixed with four whitespace characters. By setting skipinitialspace=False
, we can ensure that we don't interpret this data wrongly as some other type of data like the commit or author line.
There is one keyword that could be useful for this task, too, but isn't used here. The skip_blank_lines
parameter will not jump over lines in the given dataset, that are empty. Again, the use of this parameters depends highly on your kind if data.
In [14]:
import pandas as pd
git_log = pd.read_csv(
'../notebooks/datasets/git_log_sample.txt',
sep='\n',
names=['raw'],
skipinitialspace=False)
git_log.head()
Out[14]:
Next, we need a marking that shows which entries belong together (speaking in Data Science terms: are just variables for one observation). For this, we need an entry that marks the beginning of a such a group of entries. We can use the entries that start with commit
together with the loc
method for this purpose.
In [15]:
git_log_with_marker = git_log.reset_index()
git_log_with_marker.head()
Out[15]:
Next, we choose a suitable text in the dataset that marks the beginning of a new group of consecutive data. In our case, these are entries that start with the text "commit
". We set all other entries to None
. With this
In [16]:
git_log_with_marker.loc[
~git_log_with_marker['raw'].str.startswith('commit '),
'index'] = None
git_log_with_marker.head()
Out[16]:
If we propagate the data from the former index
column with a ffill()
(which forward fills missing data), we get a nice column that marks entries that should belong together.
In [26]:
git_log_with_marker['index'] = git_log_with_marker['index'].ffill()
git_log_with_marker.head(10)
Out[26]:
The next steps are shown in the "Dissecting data" section.
In [18]:
git_log.loc[
git_log.raw.str.startswith("commit"),
'commit_id'] = git_log['raw'].str.replace("commit ", "")
git_log.head()
Out[18]:
Next, we fill all the other columns with the information about the commit.
In [19]:
git_log['commit_id'] = git_log['commit_id'].ffill()
git_log.head()
Out[19]:
We mark each entry by its meaning. We can achieve this by using the starting characters of each raw
entry. For the commit messages, we don't have any information that could be used as marker. So we just fill in the missing information in the type
entries with a "message"
text in the last step.
Side note: When to use '
and when to use "
? When working directly with Pandas, the difference doesn't really matter. I use '
when I'm referencing keys or parameters, but "
when I'm using text information. I'm trying to use it consinstent, though.
In [20]:
git_log.loc[git_log.raw.str.startswith("commit "), 'type'] = "commit"
git_log.loc[git_log.raw.str.startswith("Author: "), 'type'] = "author"
git_log.loc[git_log.raw.str.startswith("Date: "), 'type'] = "date"
git_log['type'] = git_log['type'].fillna('message')
git_log.head()
Out[20]:
With all lines marked by their types, we can filter out the rows that only contain the information about a commit because we copied that information in all the other rows in the previous step.
In [21]:
git_log = git_log[git_log['type'] != 'commit']
git_log.head()
Out[21]:
In [22]:
git_log_data = git_log.pivot_table(
index='commit_id',
columns='type',
values='raw' ,
aggfunc=lambda x : " ".join(x))
git_log_data.head()
Out[22]:
In [23]:
git_log_data[['author', 'email']] = git_log_data['author'].str.extract(
"Author: (.*) <(.*)>", expand=True)
git_log_data.head()
Out[23]:
In [24]:
git_log_data['date'] = git_log_data['date'].str.replace("Date: ", "")
git_log_data['date'].head()
Out[24]:
In [25]:
git_log_data['date'] = pd.to_datetime(git_log_data['date'])
git_log_data.head()
Out[25]: