Data storage

This notebook will take you through some different methods of loading, interpreting and saving data.



In [1]:

    
%load_ext logit
user_name = 'FcoH'

Load data

Manual csv reading

Open the file 'gps.csv' and print out the first 5 lines



In [2]:

    
%%logit readlines, gps.csv
with open('gps.csv') as f:
    for i in range(5):
        print f.readline()









    



Time,Latitude N,Longitude E,orientation

s,deg,deg,deg

20131114 06:05:19,42.6040087,13.3692362,327.7

20131114 06:05:48,42.6041601,13.3694624,325.4

20131114 06:06:19,42.6043249,13.3697175,323.7

readlines :
14-03-05 08:48:15
File size: 6.32970333099 MB
Creation time: 0.000974286144381 sec

Make a function that loads the file and returns the contents as a list of lists like this:

[['Time', 'Latitude N', 'Longitude E', 'UTM zone32U Easting', 'UTM zone 32U northing', 'Unknown', 'Platform orientation'],
['s', 'deg', 'deg', 'm', 'm', '-', 'deg'], 
['20130925 00:00:17', '54.9490087', '11.0242362', '629642.4', '6090992.3', '277034.3', '327.7'], 
['20130925 00:00:46', '54.9491601', '11.0244624', '629656.4', '6091009.5', '277046.3', '325.4'], 
['20130925 00:01:17', '54.9493249', '11.0247175', '629672.2', '6091028.4', '277060.0', '323.7']]



In [3]:

    
def read_csv(filename):
    name=[]
    units=[]
    data=[]
    with open(filename) as f:
        name.append(f.readline().split(","))
        units.append(f.readline().split(","))
        for i in f.readlines():
            data.append(i.strip().split(","))
    return name, units, data

    
names, units, data = read_csv('gps.csv')
print "Names:", names
print "Units:", units

for r in data[:5]:
    print r
assert data[4][3]=='327.5', "%s!=%s"%(data[4][3], '327.5')









    



Names: [['Time', 'Latitude N', 'Longitude E', 'orientation\n']]
Units: [['s', 'deg', 'deg', 'deg\n']]
['20131114 06:05:19', '42.6040087', '13.3692362', '327.7']
['20131114 06:05:48', '42.6041601', '13.3694624', '325.4']
['20131114 06:06:19', '42.6043249', '13.3697175', '323.7']
['20131114 06:06:48', '42.6044777', '13.3699596', '325.6']
['20131114 06:07:19', '42.6046397', '13.3702266', '327.5']



In [4]:

    
%%logit read_csv, gps.csv  
tmp = read_csv('gps.csv')









    



read_csv :
14-03-05 08:48:20
File size: 6.32970333099 MB
Creation time: 0.182510062175 sec

CSV module

Do the same using the csv-module



In [5]:

    
import csv
def read_csv_module(filename):
    with open(filename, 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        return next(reader), next(reader),  list(reader)
        
names, units, data = read_csv_module('gps.csv')
print "Names:", names
print "Units:", units

for r in data[:5]:
    print r
assert data[4][3]=='327.5', "%s!=%s"%(data[4][3], '327.5')









    



Names: ['Time', 'Latitude N', 'Longitude E', 'orientation']
Units: ['s', 'deg', 'deg', 'deg']
['20131114 06:05:19', '42.6040087', '13.3692362', '327.7']
['20131114 06:05:48', '42.6041601', '13.3694624', '325.4']
['20131114 06:06:19', '42.6043249', '13.3697175', '323.7']
['20131114 06:06:48', '42.6044777', '13.3699596', '325.6']
['20131114 06:07:19', '42.6046397', '13.3702266', '327.5']



In [6]:

    
%%logit read_csv_module, gps.csv
tmp = read_csv_module('gps.csv')









    



read_csv_module :
14-03-05 08:48:27
File size: 6.32970333099 MB
Creation time: 0.206182592551 sec

Interpreting data

Interpret dates using the datetime-module

Constructor: datetime.datetime(year, month, day[, hour[, minute[, second[, microsecond[, tzinfo]]]]])

Parse string: datetime.strptime(date_string, format)

Format datetime: datetime.strftime(format)

Format specification: http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

Create the right format and test it on the date in row 1000:



In [7]:

    
from datetime import datetime
fmt = "%Y%m%d %H:%M:%S"

d = datetime.strptime(data[1000][0], fmt)
print d
assert d==datetime(2013,11,14,14,25,19)









    



2013-11-14 14:25:19

Make a list of datetime-objects from the date/time-strings in the first column



In [8]:

    
dates = [datetime.strptime(l[0], fmt) for l in data]
print dates[:3]









    



[datetime.datetime(2013, 11, 14, 6, 5, 19), datetime.datetime(2013, 11, 14, 6, 5, 48), datetime.datetime(2013, 11, 14, 6, 6, 19)]

Make a list with the weekday name of the dates



In [9]:

    
weekdays = [datetime.strftime(l,"%A") for l in dates]
print set(weekdays)









    



set(['Monday', 'Tuesday', 'Friday', 'Wednesday', 'Thursday', 'Sunday', 'Saturday'])

Interpret floating point values

Parse the data into a 32bit float numpy array (i.e. exclude the date/time column)



In [10]:

    
import numpy as np
np_floats = np.array(data)[:,1:].astype(np.float32)
print np_floats; 
print np_floats.shape; assert np_floats.shape == (142779, 3)
print np_floats.dtype; assert np_floats.dtype==np.float32









    



[[  42.60400772   13.36923599  327.70001221]
 [  42.60416031   13.36946201  325.3999939 ]
 [  42.60432434   13.3697176   323.70001221]
 ..., 
 [  42.48498154   13.46849155   46.29999924]
 [  42.48497772   13.46849251   46.79999924]
 [  42.48497772   13.46849251   46.79999924]]
(142779, 3)
float32

Make a dictionary with the keys:

dates, weekdays, latitude, longitude, orientation

and assign to each key the corresponding list



In [11]:

    
data_dict = {'dates':dates,
             'weekdays':weekdays,
             'latitude':np_floats[:,0],
             'longitude':np_floats[:,1],
             'orientation':np_floats[:,2]}

for k, v in data_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.604              	<type 'numpy.float32'>
dates               	2013-11-14 06:05:19 	<type 'datetime.datetime'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692             	<type 'numpy.float32'>
orientation         	327.7               	<type 'numpy.float32'>

Saving data

Before we save any data, it is necessary to consider the datatype.

Dates can be saved as

	good	bad
datetime-object	Easy to work with in python, pickles directly	file size, string representation not suitable for saving
string	Easy to read in file	file size, parsing required
64 bit float (i.e. seconds since 1/1/1970)	Compact in file	difficult to read in file, parsing required



In [12]:

    
date_objects = dates
date_strings = [ row[0] for row in data]
date_floats = [(dt - datetime.utcfromtimestamp(0)).total_seconds() for dt in date_objects]

print date_objects[0]
print date_strings[0]
print date_floats[0]









    



2013-11-14 06:05:19
20131114 06:05:19
1384409119.0

Weekdays can be saved as

String (Monday, Mon,...)
Integer (1)



In [13]:

    
weekday_strings = weekdays
weekday_integers = [int(datetime.strftime(d,"%w")) for d in dates]

print weekday_strings[0]
print weekday_integers[0]









    



Thursday
4

Before we go on, make a function that returns the size of a file in mb



In [1]:

    
import os
def fsize(filename):
    return os.path.getsize(filename)/1048576.0

print fsize('gps.csv')
assert np.round(fsize('gps.csv'),2) == 6.33









    



---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-344e92af0e76> in <module>()
      3     return os.path.getsize(filename)/1048576.0
      4 
----> 5 print fsize('gps.csv')
      6 assert np.round(fsize('gps.csv'),2) == 6.33

<ipython-input-1-344e92af0e76> in fsize(filename)
      1 import os
      2 def fsize(filename):
----> 3     return os.path.getsize(filename)/1048576.0
      4 
      5 print fsize('gps.csv')

/usr/lib/python2.7/genericpath.pyc in getsize(filename)
     47 def getsize(filename):
     48     """Return the size of a file, reported by os.stat()."""
---> 49     return os.stat(filename).st_size
     50 
     51 

OSError: [Errno 2] No such file or directory: 'gps.csv'

Save manual as CSV

Make a function save_csv that saves its data-argument (list of list) as csv with names and units as header



In [15]:

    
def save_csv(filename, names, units, data):
    with open(filename,'w') as f:
        f.write(",".join(names)+"\n")
        f.write(",".join(units)+"\n")
        f.writelines([",".join(l)+"\n" for l in data])



In [16]:

    
%%logit save_csv, tmp.csv
save_csv('tmp.csv', names, units, data)









    



save_csv :
14-03-05 08:48:54
File size: 6.32970333099 MB
Creation time: 0.0623296044619 sec



In [17]:

    
import os
print fsize('tmp.csv'), fsize('gps.csv')
assert fsize('tmp.csv') == fsize('gps.csv')









    



6.32970333099 6.32970333099

CSV module

Do the same using the csv-module



In [18]:

    
import csv
def save_csv_module(filename, names, units, data):
    with open(filename, 'wb') as csvfile:
        writ = csv.writer(csvfile, delimiter=',')
        writ.writerow(names)
        writ.writerow(units)
        for l in data:
            writ.writerow(l)
%timeit save_csv_module('tmp.csv', names, units, data)









    



10 loops, best of 3: 151 ms per loop



In [19]:

    
%%logit save_csv_module, tmp.csv
save_csv_module('tmp.csv', names, units, data)









    



save_csv_module :
14-03-05 08:49:10
File size: 6.32970333099 MB
Creation time: 0.164217963358 sec



In [20]:

    
import os
print fsize('tmp.csv'), fsize('gps.csv')
assert fsize('tmp.csv') == fsize('gps.csv')









    



6.32970333099 6.32970333099

Numpy

Numpy arrays can be saved directly as csv, binary and compressed binary. Unfortunately these options do not support datetime-objects or strings.

So we need to concatenate a np_data array of date_floats, weekday_integers and np_floats



In [21]:

    
print np.atleast_2d(np.array(date_floats)).shape
print np.atleast_2d(np.array(weekday_integers)).shape
print np_floats.T.shape

np_data = np.r_[np.atleast_2d(np.array(date_floats)), np.atleast_2d(np.array(weekday_integers)),np_floats.T].T.astype(np.float64)
                
                
print np_data.shape
assert np_data.shape==(142779,5)
print np_data.dtype
assert np_data.dtype==np.float64









    



(1, 142779)
(1, 142779)
(3, 142779)
(142779, 5)
float64

Save the float array np_data as csv and binary and compressed binary using numpy



In [22]:

    
%%logit save_csv, tmp.csv
np.savetxt('tmp.csv', np_data)









    



save_csv :
14-03-05 08:49:17
File size: 17.1586608887 MB
Creation time: 0.755722172974 sec



In [23]:

    
%%logit save_bin, tmp.npy
np.save('tmp', np_data)









    



save_bin :
14-03-05 08:49:20
File size: 5.44666290283 MB
Creation time: 0.00311938825627 sec



In [26]:

    
%%logit save_zipbin, tmp.npz
np.savez('tmp', np_data)









    



save_zipbin :
14-03-05 08:49:29
File size: 5.44677352905 MB
Creation time: 0.0212316831924 sec

Pickle

Pickle is a way to save and load python objects (e.g. data_dict) directly, even though it contains a mix of dict, list string, float and datetime-objects, as you can see by running the next cell



In [27]:

    
for k, v in data_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.604              	<type 'numpy.float32'>
dates               	2013-11-14 06:05:19 	<type 'datetime.datetime'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692             	<type 'numpy.float32'>
orientation         	327.7               	<type 'numpy.float32'>

Save data_dict as binary using cpickle - a faster implementation of pickle

See example at https://wiki.python.org/moin/UsingPickle



In [28]:

    
%%logit cpickle_dump, tmp.bin
import cPickle as pickle
pickle.dump(data_dict, open( "tmp.bin", "wb" ))
print fsize('tmp.bin')









    



13.5226831436
cpickle_dump :
14-03-05 08:49:32
File size: 13.5226831436 MB
Creation time: 0.710002189578 sec



In [29]:

    
loaded_dict = pickle.load(open('tmp.bin'))
for k, v in loaded_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.604              	<type 'numpy.float32'>
dates               	2013-11-14 06:05:19 	<type 'datetime.datetime'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692             	<type 'numpy.float32'>
orientation         	327.7               	<type 'numpy.float32'>

Json

Json is a human readable standard data format. It is quite similar to xml, but a little more compact (compression (like zip) will reduce the file size of both xml and json considerably).

The python json module are able to format e.g. a python dictionary as json, but only pure python types are supported, i.e. not datetime-objects or numpy arrays. Therefore a pure python dict is created



In [30]:

    
pydata_dict = {'dates': date_strings,
               'weekdays':weekdays, 
             'latitude':np_data[:,2].tolist(), 
             'longitude':np_data[:,3].tolist(), 
             'orientation':np_data[:,4].tolist()}

for k, v in pydata_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.6040077209       	<type 'float'>
dates               	20131114 06:05:19   	<type 'str'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692359924       	<type 'float'>
orientation         	327.700012207       	<type 'float'>

The following function saves dumps the data argument into a json (pretty) formated file



In [31]:

    
import json
def save_json(filename, data):
    with open(filename,'w') as fid:
        fid.write(json.dumps(data, sort_keys=True, indent=4, separators=(',', ': ')))



In [32]:

    
%%logit save_json, tmp.json
save_json('tmp.json', pydata_dict)









    



save_json :
14-03-05 08:49:42
File size: 18.1186351776 MB
Creation time: 1.11201781769 sec

HDF5

HDF5 is a hirachical dataformat like json and xml, but it is binary and therefore much more compact. The drawback is that a program is required to read the files, e.g. HDFView

In order to use HDF5 from python the module h5py (http://www.h5py.org/) is required (included in winpython, but not in anaconda)

pip install h5py

HDF5 stores numpy arrays as datasets, so pydata_dict can be saved like this



In [33]:

    
for k, v in pydata_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.6040077209       	<type 'float'>
dates               	20131114 06:05:19   	<type 'str'>
weekdays            	Thursday            	<type 'str'>
longitude           	13.3692359924       	<type 'float'>
orientation         	327.700012207       	<type 'float'>



In [34]:

    
import h5py
def save_hdf5(filename, data):
    f = h5py.File(filename, "w")
    try:
        for k,v in data.items():
            f.create_dataset(k,data=np.array(v))
    except Exception:
        pass
    f.close()



In [35]:

    
%%logit save_hdf5, tmp.hdf5
save_hdf5('tmp.hdf5', pydata_dict)









    



save_hdf5 :
14-03-05 08:49:53
File size: 6.81223106384 MB
Creation time: 0.0529813232045 sec

It is however possible to reduce the file size if proper data types are manually selected for all lists



In [36]:

    
npdata_dict = {'dates':np_data[:,0],
               'weekdays': np_data[:,1].astype(np.int8), 
             'latitude':np_data[:,2], 
             'longitude':np_data[:,3].astype(np.float32), 
             'orientation':np_data[:,4].astype(np.float32)}

for k, v in npdata_dict.items():
    print "%-20s\t%-20s\t%s" % (k, v[0], v[0].__class__)









    



latitude            	42.6040077209       	<type 'numpy.float64'>
dates               	1384409119.0        	<type 'numpy.float64'>
weekdays            	4                   	<type 'numpy.int8'>
longitude           	13.3692             	<type 'numpy.float32'>
orientation         	327.7               	<type 'numpy.float32'>



In [37]:

    
%%logit save_hdf5_astype, tmp.hdf5
save_hdf5('tmp.hdf5', npdata_dict)









    



save_hdf5_astype :
14-03-05 08:49:56
File size: 3.40811443329 MB
Creation time: 0.00593580887417 sec



In [38]: