Every time I encounter new data file. There are few initial "looks" that I take on it. This help me understand if I can load the whole set to memory and what are the fields there. Since I'm command line oriented, I use linux command line utilities to do that (which are easily accesible from Jupython with !), but it's easily done with Python as well.
As an example, we'll use a subset of the NYC taxi dataset. The file is called taxi.csv.
In [1]:
# Command line
!ls -lh taxi.csv
In [2]:
# Python
from os import path
print('%.2f KB' % (path.getsize('taxi.csv')/(1<<10)))
print('%.2f MB' % (path.getsize('taxi.csv')/(1<<20)))
In [3]:
# Command line
!wc -l taxi.csv
In [4]:
# Python
with open('taxi.csv') as fp:
print(sum(1 for _ in fp))
In [5]:
# Command line
!head -1 taxi.csv | tr , \\n
!printf "%d fields" $(head -1 taxi.csv | tr , \\n | wc -l)
In [6]:
# Python
import csv
with open('taxi.csv') as fp:
fields = next(csv.reader(fp))
print('\n'.join(fields))
print('%d fields' % len(fields))
In [7]:
# Command line
!head -2 taxi.csv | tail -1 | tr , \\n
!printf "%d values" $(head -2 taxi.csv | tail -1 | tr , \\n | wc -l)
In [8]:
# Python
with open('taxi.csv') as fp:
fp.readline() # Skip header
values = next(csv.reader(fp))
print('\n'.join(values))
print('%d values' % len(values))
In [9]:
# Python (with field names)
from itertools import zip_longest
with open('taxi.csv') as fp:
reader = csv.reader(fp)
header = next(reader)
values = next(reader)
for col, val in zip_longest(header, values, fillvalue='???'):
print('%-20s: %s' % (col, val))
In both methods (with fields or without) we see that we have some extra empty fields at the end of each data row.
In [10]:
import pandas as pd
import numpy as np
date_cols = ['lpep_pickup_datetime', 'Lpep_dropoff_datetime']
with open('taxi.csv') as fp:
header = next(csv.reader(fp))
df = pd.read_csv(fp, names=header, usecols=np.arange(len(header)), parse_dates=date_cols)
df.head()
Out[10]:
In [ ]: