Introduction

In this short tutorial, I want to show how you can read in various formatted software data with Python and Pandas. We use the read_csv as well as the read_excel methods to accomplish our tasks.



In [ ]:

    
# Reading CSV

Reading files with mixed separators

In this section we read a more unstructured data set:

It's a Git log output in the following format.

<timestamp><whitespace><timezone><tabulator><author>

It contains two different separators: whitespace and tabular. Here is an the content of the file datasets/mixed_dataset.csv

1514531161 -0800    Linus Torvalds
1514489303 -0500    David S. Miller
1514487644 -0800    Tom Herbert
1514487643 -0800    Tom Herbert
1514482693 -0500    Willem de Bruijn

We can read in this kind of data:



In [54]:

    
import pandas as pds
pd.read_csv(
    "datasets/mixed_separators.txt",
    sep="^([0-9]*?) (.*?)\t(.*?)$",
    engine='python',
    names=['timestamp', 'timezone', 'author'],

    header=None)









    Out[54]:







  
    
      
      
      timestamp
      timezone
      author
    
  
  
    
      NaN
      1514531161
      -800
      Linus Torvalds
      NaN
    
    
      1514489303
      -500
      David S. Miller
      NaN
    
    
      1514487644
      -800
      Tom Herbert
      NaN
    
    
      1514487643
      -800
      Tom Herbert
      NaN
    
    
      1514482693
      -500
      Willem de Bruijn
      NaN

		timestamp	timezone	author
NaN	1514531161	-800	Linus Torvalds	NaN
	1514489303	-500	David S. Miller	NaN
	1514487644	-800	Tom Herbert	NaN
	1514487643	-800	Tom Herbert	NaN
	1514482693	-500	Willem de Bruijn	NaN