Data storage

This notebook will take you through some different methods of loading, interpreting and saving data.



In [1]:

    
%load_ext logit
user_name = 'alsta'

Load data

Manual csv reading

Open the file 'gps.csv' and print out the first 5 lines



In [2]:

    
%%logit readlines, gps.csv
with open('gps.csv') as f:
    print "".join(f.readlines()[:5])









    



Time,Latitude N,Longitude E,orientation
s,deg,deg,deg
20131114 06:05:19,42.6040087,13.3692362,327.7
20131114 06:05:48,42.6041601,13.3694624,325.4
20131114 06:06:19,42.6043249,13.3697175,323.7

readlines :
14-05-18 20:37:05
File size: 6.32970333099 MB
Creation time: 0.016001 sec

Make a function that loads the file and returns the contents as a list of lists like this:

[['Time', 'Latitude N', 'Longitude E', 'UTM zone32U Easting', 'UTM zone 32U northing', 'Unknown', 'Platform orientation'],
['s', 'deg', 'deg', 'm', 'm', '-', 'deg'], 
['20130925 00:00:17', '54.9490087', '11.0242362', '629642.4', '6090992.3', '277034.3', '327.7'], 
['20130925 00:00:46', '54.9491601', '11.0244624', '629656.4', '6091009.5', '277046.3', '325.4'], 
['20130925 00:01:17', '54.9493249', '11.0247175', '629672.2', '6091028.4', '277060.0', '323.7']]



In [7]:

    
def read_csv(filename):
    with open(filename) as f:
        return f.readline().split(","), f.readline().split(','), [l.strip().split(',') for l in f.readlines()]
  
names, units, data = read_csv('gps.csv')
print "Names:", names
print "Units:", units

for r in data[:3]:
    print r
assert data[4][3]=='327.5', "%s!=%s"%(data[4][3], '327.5')









    



Names: ['Time', 'Latitude N', 'Longitude E', 'orientation\r\n']
Units: ['s', 'deg', 'deg', 'deg\r\n']
['20131114 06:05:19', '42.6040087', '13.3692362', '327.7']
['20131114 06:05:48', '42.6041601', '13.3694624', '325.4']
['20131114 06:06:19', '42.6043249', '13.3697175', '323.7']



In [8]:

    
%%logit read_csv, gps.csv  
tmp = read_csv('gps.csv')









    



read_csv :
14-05-18 20:47:17
File size: 6.32970333099 MB
Creation time: 0.192012 sec

CSV module

Do the same using the csv-module



In [25]:

    
import csv
def read_csv_module(filename):
    with open(filename, 'rb') as f:
        data = list(csv.reader(f, delimiter=',', quotechar='\''))
        return data[0],data[1],data[2:]

names, units, data = read_csv_module('gps.csv')
print "Names:", names
print "Units:", units

for r in data[:3]:
    print r
assert data[4][3]=='327.5', "%s!=%s"%(data[4][3], '327.5')









    



Names: ['Time', 'Latitude N', 'Longitude E', 'orientation']
Units: ['s', 'deg', 'deg', 'deg']
['20131114 06:05:19', '42.6040087', '13.3692362', '327.7']
['20131114 06:05:48', '42.6041601', '13.3694624', '325.4']
['20131114 06:06:19', '42.6043249', '13.3697175', '323.7']



In [26]:

    
%%logit read_csv_module, gps.csv
tmp = read_csv_module('gps.csv')









    



read_csv_module :
14-05-18 21:05:47
File size: 6.32970333099 MB
Creation time: 0.172011 sec

Interpreting data

Interpret dates using the datetime-module

Constructor: datetime.datetime(year, month, day[, hour[, minute[, second[, microsecond[, tzinfo]]]]])

Parse string: datetime.strptime(date_string, format)

Format datetime: datetime.strftime(format)

Format specification: http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

Create the right format and test it on the date in row 1000:



In [22]:

    
from datetime import datetime
#fmt = "%Y%m%d %H:%M:%S"
fmt = "%Y%m%d %X"

d = datetime.strptime(data[1000][0], fmt)
print d
assert d==datetime(2013,11,14,14,25,19)









    



2013-11-14 14:25:19

Make a list of datetime-objects from the date/time-strings in the first column



In [27]:

    
dates = [datetime.strptime(line[0], fmt) for line in data]
print dates[:3]









    



[datetime.datetime(2013, 11, 14, 6, 5, 19), datetime.datetime(2013, 11, 14, 6, 5, 48), datetime.datetime(2013, 11, 14, 6, 6, 19)]

Make a list with the weekday name of the dates



In [31]:

    
weekdays = [datetime.strftime(line,"%A") for line in dates]
print set(weekdays)









    



set(['Monday', 'Tuesday', 'Friday', 'Wednesday', 'Thursday', 'Sunday', 'Saturday'])

Interpret floating point values

Parse the data into a 32bit float numpy array (i.e. exclude the date/time column)



In [32]:

    
import numpy as np
np_floats = np.array([d[1:] for d in data],dtype=np.float32)

print np_floats; 
print np_floats.shape; assert np_floats.shape == (142779, 3)
print np_floats.dtype; assert np_floats.dtype==np.float32









    



[[  42.60400772   13.36923599  327.70001221]
 [  42.60416031   13.36946201  325.3999939 ]
 [  42.60432434   13.3697176   323.70001221]
 ..., 
 [  42.48498154   13.46849155   46.29999924]
 [  42.48497772   13.46849251   46.79999924]
 [  42.48497772   13.46849251   46.79999924]]
(142779, 3)
float32

Make a dictionary with the keys:

dates, weekdays, latitude, longitude, orientation

and assign to each key the corresponding list



In [35]:

    
data_dict = {'dates':dates,
             'weekdays':weekdays,
             'latitude':np_floats[:,0],
             'longitude':np_floats[:,1],
             'orientation':np_floats[:,2]}

for k, v in data_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.604              	<type 'numpy.float32'>
dates               	2013-11-14 06:05:19 	<type 'datetime.datetime'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692             	<type 'numpy.float32'>
orientation         	327.7               	<type 'numpy.float32'>

Saving data

Before we save any data, it is necessary to consider the datatype.

Dates can be saved as

	good	bad
datetime-object	Easy to work with in python, pickles directly	file size, string representation not suitable for saving
string	Easy to read in file	file size, parsing required
64 bit float (i.e. seconds since 1/1/1970)	Compact in file	difficult to read in file, parsing required



In [36]:

    
date_objects = dates
date_strings = [ row[0] for row in data]
date_floats = [(dt - datetime.utcfromtimestamp(0)).total_seconds() for dt in date_objects]

print date_objects[0]
print date_strings[0]
print date_floats[0]









    



2013-11-14 06:05:19
20131114 06:05:19
1384409119.0

Weekdays can be saved as

String (Monday, Mon,...)
Integer (1)



In [37]:

    
weekday_strings = weekdays
weekday_integers = [int(datetime.strftime(d,"%w")) for d in dates]

print weekday_strings[0]
print weekday_integers[0]









    



Thursday
4

Before we go on, make a function that returns the size of a file in mb



In [41]:

    
import os
def fsize(filename):
    return os.path.getsize(filename)/2.**20
    
print fsize('gps.csv')
assert np.round(fsize('gps.csv'),2) == 6.33









    



6.32970333099

Save manual as CSV

Make a function save_csv that saves its data-argument (list of list) as csv with names and units as header



In [44]:

    
def save_csv(filename, names, units, data):
    with open(filename,'w') as f:
        f.write(','.join(names)+'\n\r')
        f.write(','.join(units)+'\n\r')
        [f.write(','.join(line)+'\n\r') for line in data]



In [46]:

    
%%logit save_csv, tmp.csv
save_csv('tmp.csv', names, units, data)









    



save_csv :
14-05-18 21:28:50
File size: 6.32970333099 MB
Creation time: 0.060004 sec



In [48]:

    
import os
print fsize('tmp.csv'), fsize('gps.csv')
#assert fsize('tmp.csv') == fsize('gps.csv')









    



6.32970333099 6.32970333099

CSV module

Do the same using the csv-module



In [50]:

    
import csv
def save_csv_module(filename, names, units, data):
    with open(filename, 'wb') as f:
        dat = csv.writer(f, delimiter=',')
        dat.writerow(names)
        dat.writerow(units)
        for line in data:
            dat.writerow(line)
%timeit save_csv_module('tmp.csv', names, units, data)









    



10 loops, best of 3: 174 ms per loop



In [51]:

    
%%logit save_csv_module, tmp.csv
save_csv_module('tmp.csv', names, units, data)









    



save_csv_module :
14-05-18 21:34:48
File size: 6.32970333099 MB
Creation time: 0.16801 sec



In [52]:

    
import os
print fsize('tmp.csv'), fsize('gps.csv')
assert fsize('tmp.csv') == fsize('gps.csv')









    



6.32970333099 6.32970333099

Numpy

Numpy arrays can be saved directly as csv, binary and compressed binary. Unfortunately these options do not support datetime-objects or strings.

So we need to concatenate a np_data array of date_floats, weekday_integers and np_floats



In [53]:

    
print np.atleast_2d(np.array(date_floats)).shape
print np.atleast_2d(np.array(weekday_integers)).shape
print np_floats.T.shape

np_data = np.r_[np.atleast_2d(np.array(date_floats)),np.atleast_2d(np.array(weekday_integers)),np_floats.T].T.astype(np.float64)
                
print np_data.shape
assert np_data.shape==(142779,5)
print np_data.dtype
assert np_data.dtype==np.float64









    



(1, 142779)
(1, 142779)
(3, 142779)
(142779, 5)
float64

Save the float array np_data as csv and binary and compressed binary using numpy



In [54]:

    
%%logit save_csv, tmp.csv
np.savetxt('tmp.csv', np_data)









    



save_csv :
14-05-18 21:36:44
File size: 17.0224962234 MB
Creation time: 0.732046 sec



In [57]:

    
%%logit save_bin, tmp.npy
np.save('tmp', np_data)









    



save_bin :
14-05-18 21:40:35
File size: 5.44666290283 MB
Creation time: 0.008002 sec



In [58]:

    
%%logit save_zipbin, tmp.npz
np.savez('tmp', np_data)









    



save_zipbin :
14-05-18 21:42:09
File size: 5.44677352905 MB
Creation time: 0.020001 sec

Pickle

Pickle is a way to save and load python objects (e.g. data_dict) directly, even though it contains a mix of dict, list string, float and datetime-objects, as you can see by running the next cell



In [ ]:

    
for k, v in data_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)

Save data_dict as binary using cpickle - a faster implementation of pickle

See example at https://wiki.python.org/moin/UsingPickle



In [ ]:

    
%%logit cpickle_dump, tmp.bin
import cPickle as pickle
pickle....
print fsize('tmp.bin')



In [ ]:

    
loaded_dict = pickle.load(open('tmp.bin'))
for k, v in loaded_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)

Json

Json is a human readable standard data format. It is quite similar to xml, but a little more compact (compression (like zip) will reduce the file size of both xml and json considerably).

The python json module are able to format e.g. a python dictionary as json, but only pure python types are supported, i.e. not datetime-objects or numpy arrays. Therefore a pure python dict is created



In [59]:

    
pydata_dict = {'dates': date_strings,
               'weekdays':weekdays, 
             'latitude':np_data[:,2].tolist(), 
             'longitude':np_data[:,3].tolist(), 
             'orientation':np_data[:,4].tolist()}

for k, v in pydata_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.6040077209       	<type 'float'>
dates               	20131114 06:05:19   	<type 'str'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692359924       	<type 'float'>
orientation         	327.700012207       	<type 'float'>

The following function saves dumps the data argument into a json (pretty) formated file



In [60]:

    
import json
def save_json(filename, data):
    with open(filename,'w') as fid:
        fid.write(json.dumps(data, sort_keys=True, indent=4, separators=(',', ': ')))



In [61]:

    
%%logit save_json, tmp.json
save_json('tmp.json', pydata_dict)









    



save_json :
14-05-18 21:45:30
File size: 17.4378013611 MB
Creation time: 0.912057 sec

HDF5

HDF5 is a hirachical dataformat like json and xml, but it is binary and therefore much more compact. The drawback is that a program is required to read the files, e.g. HDFView

In order to use HDF5 from python the module h5py (http://www.h5py.org/) is required (included in winpython, but not in anaconda)

pip install h5py

HDF5 stores numpy arrays as datasets, so pydata_dict can be saved like this



In [62]:

    
for k, v in pydata_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.6040077209       	<type 'float'>
dates               	20131114 06:05:19   	<type 'str'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692359924       	<type 'float'>
orientation         	327.700012207       	<type 'float'>



In [63]:

    
import h5py
def save_hdf5(filename, data):
    f = h5py.File(filename, "w")
    try:
        for k,v in data.items():
            f.create_dataset(k,data=np.array(v))
    except Exception:
        pass
    f.close()



In [64]:

    
%%logit save_hdf5, tmp.hdf5
save_hdf5('tmp.hdf5', pydata_dict)









    



save_hdf5 :
14-05-18 21:47:39
File size: 6.81223106384 MB
Creation time: 0.072005 sec

It is however possible to reduce the file size if proper data types are manually selected for all lists



In [65]:

    
npdata_dict = {'dates':np_data[:,0],
               'weekdays': np_data[:,1].astype(np.int8), 
             'latitude':np_data[:,2], 
             'longitude':np_data[:,3].astype(np.float32), 
             'orientation':np_data[:,4].astype(np.float32)}

for k, v in npdata_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.6040077209       	<type 'numpy.float64'>
dates               	1384409119.0        	<type 'numpy.float64'>
weekdays            	4                   	<type 'numpy.int8'>
longitude           	13.3692             	<type 'numpy.float32'>
orientation         	327.7               	<type 'numpy.float32'>



In [66]:

    
%%logit save_hdf5_astype, tmp.hdf5
save_hdf5('tmp.hdf5', npdata_dict)









    



save_hdf5_astype :
14-05-18 21:47:52
File size: 3.40811443329 MB
Creation time: 0.004 sec



In [ ]: