Introduction

In this short tutorial, I want to show how you can read in various formatted software data with Python and Pandas. We use the read_csv as well as the read_excel methods to accomplish our tasks.


In [ ]:
# Reading CSV

Reading files with mixed separators

In this section we read a more unstructured data set:

It's a Git log output in the following format.

<timestamp><whitespace><timezone><tabulator><author>

It contains two different separators: whitespace and tabular. Here is an the content of the file datasets/mixed_dataset.csv

1514531161 -0800    Linus Torvalds
1514489303 -0500    David S. Miller
1514487644 -0800    Tom Herbert
1514487643 -0800    Tom Herbert
1514482693 -0500    Willem de Bruijn

We can read in this kind of data:


In [54]:
import pandas as pds
pd.read_csv(
    "datasets/mixed_separators.txt",
    sep="^([0-9]*?) (.*?)\t(.*?)$",
    engine='python',
    names=['timestamp', 'timezone', 'author'],

    header=None)


Out[54]:
timestamp timezone author
NaN 1514531161 -800 Linus Torvalds NaN
1514489303 -500 David S. Miller NaN
1514487644 -800 Tom Herbert NaN
1514487643 -800 Tom Herbert NaN
1514482693 -500 Willem de Bruijn NaN