First Contact with Data

Every time I encounter new data file. There are few initial "looks" that I take on it. This help me understand if I can load the whole set to memory and what are the fields there. Since I'm command line oriented, I use linux command line utilities to do that (which are easily accesible from Jupython with !), but it's easily done with Python as well.

As an example, we'll use a subset of the NYC taxi dataset. The file is called taxi.csv.

File Size



In [1]:

    
# Command line
!ls -lh taxi.csv









    



-rw-r--r-- 1 miki miki 16M Oct  2 12:40 taxi.csv



In [2]:

    
# Python
from os import path
print('%.2f KB' % (path.getsize('taxi.csv')/(1<<10)))
print('%.2f MB' % (path.getsize('taxi.csv')/(1<<20)))









    



15789.13 KB
15.42 MB

Number of Lines



In [3]:

    
# Command line
!wc -l taxi.csv









    



100001 taxi.csv



In [4]:

    
# Python
with open('taxi.csv') as fp:
    print(sum(1 for _ in fp))



In [5]:

    
# Command line
!head -1 taxi.csv | tr , \\n
!printf "%d fields" $(head -1 taxi.csv | tr , \\n | wc -l)









    



VendorID
lpep_pickup_datetime
Lpep_dropoff_datetime
Store_and_fwd_flag
RateCodeID
Pickup_longitude
Pickup_latitude
Dropoff_longitude
Dropoff_latitude
Passenger_count
Trip_distance
Fare_amount
Extra
MTA_tax
Tip_amount
Tolls_amount
Ehail_fee
improvement_surcharge
Total_amount
Payment_type
Trip_type 
21 fields



In [6]:

    
# Python
import csv
with open('taxi.csv') as fp:
    fields = next(csv.reader(fp))
print('\n'.join(fields))
print('%d fields' % len(fields))









    



VendorID
lpep_pickup_datetime
Lpep_dropoff_datetime
Store_and_fwd_flag
RateCodeID
Pickup_longitude
Pickup_latitude
Dropoff_longitude
Dropoff_latitude
Passenger_count
Trip_distance
Fare_amount
Extra
MTA_tax
Tip_amount
Tolls_amount
Ehail_fee
improvement_surcharge
Total_amount
Payment_type
Trip_type 
21 fields

Sample Data



In [7]:

    
# Command line
!head -2 taxi.csv | tail -1 | tr , \\n
!printf "%d values" $(head -2 taxi.csv | tail -1 | tr , \\n | wc -l)









    



2
2015-03-04 15:39:16
2015-03-04 15:42:30
N
1
-73.992240905761719
40.690120697021484
-73.999664306640625
40.684993743896484
2
.71
4.5
0
0.5
0
0

0.3
5.3
2
1


23 values



In [8]:

    
# Python
with open('taxi.csv') as fp:
    fp.readline()  # Skip header
    values = next(csv.reader(fp))
print('\n'.join(values))
print('%d values' % len(values))









    



2
2015-03-04 15:39:16
2015-03-04 15:42:30
N
1
-73.992240905761719
40.690120697021484
-73.999664306640625
40.684993743896484
2
.71
4.5
0
0.5
0
0

0.3
5.3
2
1


23 values



In [9]:

    
# Python (with field names)
from itertools import zip_longest
with open('taxi.csv') as fp:
    reader = csv.reader(fp)
    header = next(reader)
    values = next(reader)
for col, val in zip_longest(header, values, fillvalue='???'):
    print('%-20s: %s' % (col, val))









    



VendorID            : 2
lpep_pickup_datetime: 2015-03-04 15:39:16
Lpep_dropoff_datetime: 2015-03-04 15:42:30
Store_and_fwd_flag  : N
RateCodeID          : 1
Pickup_longitude    : -73.992240905761719
Pickup_latitude     : 40.690120697021484
Dropoff_longitude   : -73.999664306640625
Dropoff_latitude    : 40.684993743896484
Passenger_count     : 2
Trip_distance       : .71
Fare_amount         : 4.5
Extra               : 0
MTA_tax             : 0.5
Tip_amount          : 0
Tolls_amount        : 0
Ehail_fee           : 
improvement_surcharge: 0.3
Total_amount        : 5.3
Payment_type        : 2
Trip_type           : 1
???                 : 
???                 :

In both methods (with fields or without) we see that we have some extra empty fields at the end of each data row.

Loading as DataFrame

After the initial look, we know we can load the whole data to memory and have a good idea what to tell pandas for parsing it.



In [10]:

    
import pandas as pd
import numpy as np
date_cols = ['lpep_pickup_datetime', 'Lpep_dropoff_datetime']
with open('taxi.csv') as fp:
    header = next(csv.reader(fp))
    df = pd.read_csv(fp, names=header, usecols=np.arange(len(header)), parse_dates=date_cols)
df.head()









    Out[10]:






  
    
      
      VendorID
      lpep_pickup_datetime
      Lpep_dropoff_datetime
      Store_and_fwd_flag
      RateCodeID
      Pickup_longitude
      Pickup_latitude
      Dropoff_longitude
      Dropoff_latitude
      Passenger_count
      ...
      Fare_amount
      Extra
      MTA_tax
      Tip_amount
      Tolls_amount
      Ehail_fee
      improvement_surcharge
      Total_amount
      Payment_type
      Trip_type
    
  
  
    
      0
      2
      2015-03-04 15:39:16
      2015-03-04 15:42:30
      N
      1
      -73.992241
      40.690121
      -73.999664
      40.684994
      2
      ...
      4.5
      0.0
      0.5
      0.00
      0.0
      NaN
      0.3
      5.30
      2
      1
    
    
      1
      2
      2015-03-22 17:36:49
      2015-03-22 17:45:39
      N
      5
      -73.930038
      40.819576
      -73.907173
      40.811306
      2
      ...
      12.0
      0.0
      0.0
      0.00
      0.0
      NaN
      0.0
      12.00
      2
      2
    
    
      2
      2
      2015-03-25 22:08:45
      2015-03-25 22:53:29
      N
      1
      -73.961082
      40.807022
      -73.984642
      40.663147
      1
      ...
      45.0
      0.5
      0.5
      9.26
      0.0
      NaN
      0.3
      55.56
      1
      1
    
    
      3
      2
      2015-03-16 13:45:20
      2015-03-16 13:52:04
      N
      1
      -73.913200
      40.777962
      -73.926994
      40.772743
      2
      ...
      6.5
      0.0
      0.5
      0.00
      0.0
      NaN
      0.3
      7.30
      2
      1
    
    
      4
      2
      2015-03-19 18:53:50
      2015-03-19 18:59:04
      N
      1
      -73.925888
      40.827602
      -73.916351
      40.824966
      1
      ...
      5.5
      1.0
      0.5
      0.00
      0.0
      NaN
      0.3
      7.30
      2
      1
    
  

5 rows × 21 columns



In [ ]:

	VendorID	lpep_pickup_datetime	Lpep_dropoff_datetime	Store_and_fwd_flag	RateCodeID	Pickup_longitude	Pickup_latitude	Dropoff_longitude	Dropoff_latitude	Passenger_count	...	Fare_amount	Extra	MTA_tax	Tip_amount	Ehail_fee	improvement_surcharge	Total_amount	Payment_type	Trip_type
0	2	2015-03-04 15:39:16	2015-03-04 15:42:30	N	1	-73.992241	40.690121	-73.999664	40.684994	2	...	4.5	0.0	0.5	0.00	NaN	0.3	5.30	2	1
1	2	2015-03-22 17:36:49	2015-03-22 17:45:39	N	5	-73.930038	40.819576	-73.907173	40.811306	2	...	12.0	0.0	0.0	0.00	NaN	0.0	12.00	2	2
2	2	2015-03-25 22:08:45	2015-03-25 22:53:29	N	1	-73.961082	40.807022	-73.984642	40.663147	1	...	45.0	0.5	0.5	9.26	NaN	0.3	55.56	1	1
3	2	2015-03-16 13:45:20	2015-03-16 13:52:04	N	1	-73.913200	40.777962	-73.926994	40.772743	2	...	6.5	0.0	0.5	0.00	NaN	0.3	7.30	2	1
4	2	2015-03-19 18:53:50	2015-03-19 18:59:04	N	1	-73.925888	40.827602	-73.916351	40.824966	1	...	5.5	1.0	0.5	0.00	NaN	0.3	7.30	2	1

First Contact with Data

File Size

Number of Lines

Header

Sample Data

Loading as DataFrame