Introduction

Sometimes you want to read in data that contains different kinds of information per line. This could be a log file where some lines contain a timestamp and the log messages itself, but other lines could contain specific information about errors or stack traces.

The first thing that we need to accomplish is to read in the file into Pandas. There is a nice little trick that you can use to read almost every kind of structured data into a Pandas DataFrame. We just have to use a separator that reads in the content of a file line by line. We then have to mark each line by their meaning. We can to this by taking some Once we have the data into a DataFrame, we can use Pandas functionality to reshape the data as we need it.

Example data set

For this example, I choose a Git log output from the Sprint PetClinic project. Such a dataset look like the following:

commit 75912a06c5613a2ea1305ad4d8ad6bc4be7765ce
Author: Stephane Nicoll <snicoll@pivotal.io>
Date:   Fri Feb 17 12:30:57 2017 +0100

    Polish contribution

    Closes gh-229

commit 443d35eae23c874ed38305fbe75216339c41beaf
Author: Henri Tremblay <henri.tremblay@gmail.com>
Date:   Thu Feb 16 15:08:30 2017 -0500

    Put Ehcache back

This is a pretty challenging dataset: Albeit we have some lines that could be used as markers (like "commit:", "Author" and "Date:"), we also have multiple lines without a marker text. Luckily, there is some convention here: The first line in this particular dataset is the so called "subject" or "title line", the other lines is the "full commit message". This makes it possible to differentiate between these two kind of information as well. So the data structure for one complete Git log entry looks like this (from doc):

commit <sha1>
Author: <author>
Date:   <author date>
[empty]
<title line>
[empty]
<full commit message>
[empty]

It may seem impossible to read in that data in a Pandas DataFrame, but we'll see that it's relative easy to do once we know some basic techniques for treating these kind of semi-structured data.

Granted, this particular case is some kind of artificial because we can output the information in a Git repository in one line using the `--pretty=format:` option. But I choose that one as example because it's easy to understand. Overall, there could be other cases where the following technique could be get very handy.

Reading in the file

Let's start by reading in the given dataset. Especially take a look at the choosen sep parameter: Here, we use the newline operator \n as separator. This puts every line from our dataset file into a separate data row of a DataFrame.

The parameter names takes a array of column names that present the headers for the DataFrame. This is necessary for our specific dataset because it doesn't have any specific header at the top of the file.

We also make use of the skipinitialspace parameter. This will not delete the whitespace in front of the first characters. In our case, this is very useful because the commit's subject and message lines are prefixed with four whitespace characters. By setting skipinitialspace=False, we can ensure that we don't interpret this data wrongly as some other type of data like the commit or author line.

There is one keyword that could be useful for this task, too, but isn't used here. The skip_blank_lines parameter will not jump over lines in the given dataset, that are empty. Again, the use of this parameters depends highly on your kind if data.



In [14]:

    
import pandas as pd
git_log = pd.read_csv(
    '../notebooks/datasets/git_log_sample.txt',
    sep='\n',
    names=['raw'],
    skipinitialspace=False)
git_log.head()









    Out[14]:







  
    
      
      raw
    
  
  
    
      0
      commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      1
      Author: Dave Syer <dsyer@pivotal.io>
    
    
      2
      Date:   Fri Jun 30 11:07:07 2017 +0100
    
    
      3
      Update Spring Boot and Thymeleaf versions
    
    
      4
      commit ffa967c94b65a70ea6d3b44275632821838d9fd3

Marking data that belongs together

Next, we need a marking that shows which entries belong together (speaking in Data Science terms: are just variables for one observation). For this, we need an entry that marks the beginning of a such a group of entries. We can use the entries that start with commit together with the loc method for this purpose.



In [15]:

    
git_log_with_marker = git_log.reset_index()
git_log_with_marker.head()









    Out[15]:







  
    
      
      index
      raw
    
  
  
    
      0
      0
      commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      1
      1
      Author: Dave Syer <dsyer@pivotal.io>
    
    
      2
      2
      Date:   Fri Jun 30 11:07:07 2017 +0100
    
    
      3
      3
      Update Spring Boot and Thymeleaf versions
    
    
      4
      4
      commit ffa967c94b65a70ea6d3b44275632821838d9fd3

Next, we choose a suitable text in the dataset that marks the beginning of a new group of consecutive data. In our case, these are entries that start with the text "commit". We set all other entries to None. With this



In [16]:

    
git_log_with_marker.loc[
      ~git_log_with_marker['raw'].str.startswith('commit '), 
      'index'] = None
git_log_with_marker.head()









    Out[16]:







  
    
      
      index
      raw
    
  
  
    
      0
      0.0
      commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      1
      NaN
      Author: Dave Syer <dsyer@pivotal.io>
    
    
      2
      NaN
      Date:   Fri Jun 30 11:07:07 2017 +0100
    
    
      3
      NaN
      Update Spring Boot and Thymeleaf versions
    
    
      4
      4.0
      commit ffa967c94b65a70ea6d3b44275632821838d9fd3

If we propagate the data from the former index column with a ffill() (which forward fills missing data), we get a nice column that marks entries that should belong together.



In [26]:

    
git_log_with_marker['index'] = git_log_with_marker['index'].ffill()
git_log_with_marker.head(10)









    Out[26]:







  
    
      
      index
      raw
      id
    
  
  
    
      0
      0.0
      commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
      0.0
    
    
      1
      0.0
      Author: Dave Syer <dsyer@pivotal.io>
      0.0
    
    
      2
      0.0
      Date:   Fri Jun 30 11:07:07 2017 +0100
      0.0
    
    
      3
      0.0
      Update Spring Boot and Thymeleaf versions
      0.0
    
    
      4
      4.0
      commit ffa967c94b65a70ea6d3b44275632821838d9fd3
      4.0
    
    
      5
      4.0
      Author: Antoine Rey <antoine.rey@gmail.com>
      4.0
    
    
      6
      4.0
      Date:   Wed Apr 12 21:41:00 2017 +0200
      4.0
    
    
      7
      4.0
      spring-petclinic-angular1 repo renamed to ...
      4.0
    
    
      8
      8.0
      commit fd1c742d4f8d193eb935519909c15302b783cd52
      8.0
    
    
      9
      8.0
      Author: Antoine Rey <antoine.rey@gmail.com>
      8.0

The next steps are shown in the "Dissecting data" section.

The index key column approach

Alternatively, we can use also an entry that is suitable as index later on. Again, we use the entries with the commit information. Analog, we create a new column with this information. We clean up the commit columns at the same time.



In [18]:

    
git_log.loc[
    git_log.raw.str.startswith("commit"),
    'commit_id'] = git_log['raw'].str.replace("commit ", "")
git_log.head()









    Out[18]:







  
    
      
      raw
      commit_id
    
  
  
    
      0
      commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      1
      Author: Dave Syer <dsyer@pivotal.io>
      NaN
    
    
      2
      Date:   Fri Jun 30 11:07:07 2017 +0100
      NaN
    
    
      3
      Update Spring Boot and Thymeleaf versions
      NaN
    
    
      4
      commit ffa967c94b65a70ea6d3b44275632821838d9fd3
      ffa967c94b65a70ea6d3b44275632821838d9fd3

Next, we fill all the other columns with the information about the commit.



In [19]:

    
git_log['commit_id'] = git_log['commit_id'].ffill()
git_log.head()









    Out[19]:







  
    
      
      raw
      commit_id
    
  
  
    
      0
      commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      1
      Author: Dave Syer <dsyer@pivotal.io>
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      2
      Date:   Fri Jun 30 11:07:07 2017 +0100
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      3
      Update Spring Boot and Thymeleaf versions
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
    
    
      4
      commit ffa967c94b65a70ea6d3b44275632821838d9fd3
      ffa967c94b65a70ea6d3b44275632821838d9fd3

Dissecting the data

We mark each entry by its meaning. We can achieve this by using the starting characters of each raw entry. For the commit messages, we don't have any information that could be used as marker. So we just fill in the missing information in the type entries with a "message" text in the last step.

Side note: When to use ' and when to use " ? When working directly with Pandas, the difference doesn't really matter. I use ' when I'm referencing keys or parameters, but " when I'm using text information. I'm trying to use it consinstent, though.



In [20]:

    
git_log.loc[git_log.raw.str.startswith("commit "), 'type'] = "commit"
git_log.loc[git_log.raw.str.startswith("Author: "), 'type'] = "author"
git_log.loc[git_log.raw.str.startswith("Date: "), 'type'] = "date"
git_log['type'] = git_log['type'].fillna('message')
git_log.head()









    Out[20]:







  
    
      
      raw
      commit_id
      type
    
  
  
    
      0
      commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
      commit
    
    
      1
      Author: Dave Syer <dsyer@pivotal.io>
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
      author
    
    
      2
      Date:   Fri Jun 30 11:07:07 2017 +0100
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
      date
    
    
      3
      Update Spring Boot and Thymeleaf versions
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
      message
    
    
      4
      commit ffa967c94b65a70ea6d3b44275632821838d9fd3
      ffa967c94b65a70ea6d3b44275632821838d9fd3
      commit

With all lines marked by their types, we can filter out the rows that only contain the information about a commit because we copied that information in all the other rows in the previous step.



In [21]:

    
git_log = git_log[git_log['type'] != 'commit']
git_log.head()









    Out[21]:







  
    
      
      raw
      commit_id
      type
    
  
  
    
      1
      Author: Dave Syer <dsyer@pivotal.io>
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
      author
    
    
      2
      Date:   Fri Jun 30 11:07:07 2017 +0100
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
      date
    
    
      3
      Update Spring Boot and Thymeleaf versions
      101c9dc69064633f697d93dcf0918bb4f74ff7ed
      message
    
    
      5
      Author: Antoine Rey <antoine.rey@gmail.com>
      ffa967c94b65a70ea6d3b44275632821838d9fd3
      author
    
    
      6
      Date:   Wed Apr 12 21:41:00 2017 +0200
      ffa967c94b65a70ea6d3b44275632821838d9fd3
      date



In [22]:

    
git_log_data = git_log.pivot_table(
    index='commit_id',
    columns='type',
    values='raw' ,
    aggfunc=lambda x : " ".join(x))
git_log_data.head()









    Out[22]:







  
    
      type
      author
      date
      message
    
    
      commit_id
      
      
      
    
  
  
    
      024811d252f8d8218e6795d46203cff25971bc19
      Author: Mic <misvy@vmware.com>
      Date:   Thu Mar 14 18:04:36 2013 +0800
      simplifying access to Integer
    
    
      0504ec9fe345d9d34b15c374333f709fb147e6d6
      Author: thinksh <thinkshihang@gmail.com>
      Date:   Wed Feb 3 23:19:46 2016 -0500
      Update petclinic_db_setup_mysql.txt     Co...
    
    
      053c84ecc95b246ef4a40fb3d4304e8908604af4
      Author: Mic <misvy@vmware.com>
      Date:   Mon Feb 3 09:31:44 2014 +0800
      migrated to Spring 4.0.1
    
    
      057015c14cce4791ff309419de8a8bd6339fd6e7
      Author: Mic <misvy@vmware.com>
      Date:   Fri Feb 15 15:31:04 2013 +0800
      Spring MVC Test Framework and migration to...
    
    
      05c1110dceeaef0626137a2f7a509add6617765b
      Author: Mic <misvy@vmware.com>
      Date:   Tue Jan 15 09:29:01 2013 +0800
      fixed content negotiation configuration



In [23]:

    
git_log_data[['author', 'email']] = git_log_data['author'].str.extract(
    "Author: (.*) <(.*)>", expand=True)
git_log_data.head()









    Out[23]:







  
    
      type
      author
      date
      message
      email
    
    
      commit_id
      
      
      
      
    
  
  
    
      024811d252f8d8218e6795d46203cff25971bc19
      Mic
      Date:   Thu Mar 14 18:04:36 2013 +0800
      simplifying access to Integer
      misvy@vmware.com
    
    
      0504ec9fe345d9d34b15c374333f709fb147e6d6
      thinksh
      Date:   Wed Feb 3 23:19:46 2016 -0500
      Update petclinic_db_setup_mysql.txt     Co...
      thinkshihang@gmail.com
    
    
      053c84ecc95b246ef4a40fb3d4304e8908604af4
      Mic
      Date:   Mon Feb 3 09:31:44 2014 +0800
      migrated to Spring 4.0.1
      misvy@vmware.com
    
    
      057015c14cce4791ff309419de8a8bd6339fd6e7
      Mic
      Date:   Fri Feb 15 15:31:04 2013 +0800
      Spring MVC Test Framework and migration to...
      misvy@vmware.com
    
    
      05c1110dceeaef0626137a2f7a509add6617765b
      Mic
      Date:   Tue Jan 15 09:29:01 2013 +0800
      fixed content negotiation configuration
      misvy@vmware.com



In [24]:

    
git_log_data['date'] =  git_log_data['date'].str.replace("Date: ", "")
git_log_data['date'].head()









    Out[24]:





commit_id
024811d252f8d8218e6795d46203cff25971bc19      Thu Mar 14 18:04:36 2013 +0800
0504ec9fe345d9d34b15c374333f709fb147e6d6       Wed Feb 3 23:19:46 2016 -0500
053c84ecc95b246ef4a40fb3d4304e8908604af4       Mon Feb 3 09:31:44 2014 +0800
057015c14cce4791ff309419de8a8bd6339fd6e7      Fri Feb 15 15:31:04 2013 +0800
05c1110dceeaef0626137a2f7a509add6617765b      Tue Jan 15 09:29:01 2013 +0800
Name: date, dtype: object



In [25]:

    
git_log_data['date'] = pd.to_datetime(git_log_data['date'])
git_log_data.head()









    Out[25]:







  
    
      type
      author
      date
      message
      email
    
    
      commit_id
      
      
      
      
    
  
  
    
      024811d252f8d8218e6795d46203cff25971bc19
      Mic
      2013-03-14 10:04:36
      simplifying access to Integer
      misvy@vmware.com
    
    
      0504ec9fe345d9d34b15c374333f709fb147e6d6
      thinksh
      2016-02-04 04:19:46
      Update petclinic_db_setup_mysql.txt     Co...
      thinkshihang@gmail.com
    
    
      053c84ecc95b246ef4a40fb3d4304e8908604af4
      Mic
      2014-02-03 01:31:44
      migrated to Spring 4.0.1
      misvy@vmware.com
    
    
      057015c14cce4791ff309419de8a8bd6339fd6e7
      Mic
      2013-02-15 07:31:04
      Spring MVC Test Framework and migration to...
      misvy@vmware.com
    
    
      05c1110dceeaef0626137a2f7a509add6617765b
      Mic
      2013-01-15 01:29:01
      fixed content negotiation configuration
      misvy@vmware.com

	raw
0	commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1	Author: Dave Syer <dsyer@pivotal.io>
2	Date: Fri Jun 30 11:07:07 2017 +0100
3	Update Spring Boot and Thymeleaf versions
4	commit ffa967c94b65a70ea6d3b44275632821838d9fd3

	raw	commit_id
0	commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed	101c9dc69064633f697d93dcf0918bb4f74ff7ed
1	Author: Dave Syer <dsyer@pivotal.io>	NaN
2	Date: Fri Jun 30 11:07:07 2017 +0100	NaN
3	Update Spring Boot and Thymeleaf versions	NaN
4	commit ffa967c94b65a70ea6d3b44275632821838d9fd3	ffa967c94b65a70ea6d3b44275632821838d9fd3

	raw	commit_id	type
0	commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed	101c9dc69064633f697d93dcf0918bb4f74ff7ed	commit
1	Author: Dave Syer <dsyer@pivotal.io>	101c9dc69064633f697d93dcf0918bb4f74ff7ed	author
2	Date: Fri Jun 30 11:07:07 2017 +0100	101c9dc69064633f697d93dcf0918bb4f74ff7ed	date
3	Update Spring Boot and Thymeleaf versions	101c9dc69064633f697d93dcf0918bb4f74ff7ed	message
4	commit ffa967c94b65a70ea6d3b44275632821838d9fd3	ffa967c94b65a70ea6d3b44275632821838d9fd3	commit

type	author	date	message
commit_id
024811d252f8d8218e6795d46203cff25971bc19	Author: Mic <misvy@vmware.com>	Date: Thu Mar 14 18:04:36 2013 +0800	simplifying access to Integer
0504ec9fe345d9d34b15c374333f709fb147e6d6	Author: thinksh <thinkshihang@gmail.com>	Date: Wed Feb 3 23:19:46 2016 -0500	Update petclinic_db_setup_mysql.txt Co...
053c84ecc95b246ef4a40fb3d4304e8908604af4	Author: Mic <misvy@vmware.com>	Date: Mon Feb 3 09:31:44 2014 +0800	migrated to Spring 4.0.1
057015c14cce4791ff309419de8a8bd6339fd6e7	Author: Mic <misvy@vmware.com>	Date: Fri Feb 15 15:31:04 2013 +0800	Spring MVC Test Framework and migration to...
05c1110dceeaef0626137a2f7a509add6617765b	Author: Mic <misvy@vmware.com>	Date: Tue Jan 15 09:29:01 2013 +0800	fixed content negotiation configuration

type	author	date	message	email
commit_id
024811d252f8d8218e6795d46203cff25971bc19	Mic	2013-03-14 10:04:36	simplifying access to Integer	misvy@vmware.com
0504ec9fe345d9d34b15c374333f709fb147e6d6	thinksh	2016-02-04 04:19:46	Update petclinic_db_setup_mysql.txt Co...	thinkshihang@gmail.com
053c84ecc95b246ef4a40fb3d4304e8908604af4	Mic	2014-02-03 01:31:44	migrated to Spring 4.0.1	misvy@vmware.com
057015c14cce4791ff309419de8a8bd6339fd6e7	Mic	2013-02-15 07:31:04	Spring MVC Test Framework and migration to...	misvy@vmware.com
05c1110dceeaef0626137a2f7a509add6617765b	Mic	2013-01-15 01:29:01	fixed content negotiation configuration	misvy@vmware.com