Reading and Writing Files

by David Parades (david.paredes@durham.ac.uk) and James Keaveney (james.keaveney@durham.ac.uk)

Two ways of opening a data file

In python, there are two ways of opening a data file:

  • Using the built in open function
  • Using libraries that parse the file, if it has some format

Usually, we will deal with highly structured files, so the second method will be easier in most cases (in the case of big files this is no longer true).

In this tutorial, we will overview two modules/functions, numpy.loadtxt and csv, that allow you to get data from comma-separated-value files (csv), where the different data values reside in a file separated by commas, spaces or other delimiters (as the name suggests...)

loadtxt

The function numpy.loadtxt also present as scipy.loadtxt in the scipy package, is very simple to use. You just use the name of the file in string format (that is, separated by normal or double-quotes) and the output is an array with the contents of the file.

For example, let's load the information in the file simpleDataset.csv and store it in variable myDataset.


In [75]:
import numpy     #Remember to import the module
myDataset = numpy.loadtxt("./code/io/csv_example/simpleDataset.csv")
print myDataset


[[  1.       3.023 ]
 [  2.       5.1   ]
 [  6.      23.    ]
 [  6.6      2.    ]
 [  8.       9.23  ]
 [  8.1     10.0001]]

The text file contains two columns of numbers separated by one space. By default, the function takes as a delimiter any white space. If, for example, the values were separated by commas (as in simpleDatasetComma.txt), we would need to specify the variable delimiter, which is a string containing the delimiting character(s).


In [76]:
myDatasetComma = numpy.loadtxt("./code/io/csv_example/simpleDatasetComma.txt", delimiter=',')
print myDatasetComma


[[  1.       3.023 ]
 [  2.       5.1   ]
 [  6.      23.    ]
 [  6.6      2.    ]
 [  8.       9.23  ]
 [  8.1     10.0001]]

Some files contain headers that are not part of the data (see fileWithHeader.csv). This file contains 3 lines that give information about the data, but that is not data. The header can be skipped by using keyword skiprows


In [77]:
complicatedDataset = numpy.loadtxt("./code/io/csv_example/fileWithHeader.csv",delimiter=',', skiprows=3)
print complicatedDataset


[[-0.0836      0.479172    0.00209844  0.202813  ]
 [-0.083598    0.479313    0.00194219  0.202813  ]
 [-0.083596    0.479313    0.00180156  0.20275   ]
 [-0.083594    0.479313    0.00191875  0.20277   ]
 [-0.083592    0.478969    0.00184531  0.20275   ]]

You can also select which columns to extract from the file with the keyword usecols


In [78]:
justSomeCols = numpy.loadtxt("./code/io/csv_example/fileWithHeader.csv", 
                                   delimiter=',', skiprows=3, usecols=(1,3))
print justSomeCols


[[ 0.479172  0.202813]
 [ 0.479313  0.202813]
 [ 0.479313  0.20275 ]
 [ 0.479313  0.20277 ]
 [ 0.478969  0.20275 ]]

You can get more information about numpy.loadtxt in the scipy page, or using the help function, as described in the page Basics - Help and information.

csv

The module csv has a syntax very similar to that of loadtxt but is much more powerful in the sense that it can handle all sorts of data types. The official documentation states:

The lack of a standard (for csv files) means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources._

This modules provides classes to read and write tabular data from/to different formats. The most basic example of reading with this module would be:


In [79]:
import csv
with open("./code/io/csv_example/simpleDataset.csv", 'rb') as ultraImportantFile:
    importantReader = csv.reader(ultraImportantFile, delimiter=' ')
    for row in importantReader:
        print row


['1', '3.023']
['2', '5.1']
['6', '23']
['6.6', '2']
['8', '9.23']
['8.1', '10.0001']

As you can see (look at the quotation marks), the reader stores the data as a list of strings separated by the delimiter. This is a default, since a lot of csv files can have heterogeneous data. It is possible to use the function float to convert the data to floats (for example), like


In [80]:
with open("./code/io/csv_example/simpleDataset.csv", 'rb') as ultraImportantFile:
    data = list(csv.reader(ultraImportantFile, delimiter=' '))
    print float(data[0][1])


3.023

It is also possible to give significative names to columns. For example, if the two columns in the file represent "current" and "voltage", using the DictReader function


In [81]:
currentData = []
voltData = []
cols = ['current', 'voltage']
with open("./code/io/csv_example/simpleDatasetComma.txt", 'rb') as csvfile:
    for row in csv.DictReader(csvfile, fieldnames=cols, delimiter=','):
        # Convert non-string data here e.g.:
        thiscurrent = float(row['current'])
        thisvoltage = float(row['voltage'])
        currentData.append(thiscurrent)
        voltData.append(thisvoltage)
    print currentData, voltData


[1.0, 2.0, 6.0, 6.6, 8.0, 8.1] [3.023, 5.1, 23.0, 2.0, 9.23, 10.0001]

File Output

So far we've looked at reading files in, but occasionally it's useful to be able to save processed data. If the data needs to be human-read, or read again outside of the code it's written in, then outputting as a csv is useful. If this is not the case, for example you run a calculation that takes a long time and you need to save the result so that the next time the program is run you just look the result up instead of re-calculating, then pickled (just the binary) data is the easiest format to use. We will give examples of both below.

CSV writing

Let's generate some csv data to eventually export:


In [82]:
import numpy as np

#generate random data
x = np.arange(-10,10,0.01)
y = np.sin(3*x**2)*np.cos(x)**2

# print the first few lines
print x[0:10], y[0:10]


[-10.    -9.99  -9.98  -9.97  -9.96  -9.95  -9.94  -9.93  -9.92  -9.91] [-0.70386913 -0.57965427 -0.24754818  0.17987328  0.55413639  0.74256453
  0.67521336  0.3704655  -0.06997309 -0.49529355]

Now let's write this into a two-column csv file. We write row-by-row so first we need to convert the data into a 2d-array. We do this using the built-in zip function:


In [83]:
xy = zip(x,y)

# look at the first few lines of this - note the format is different to the previous block
print xy[0:10]


[(-10.0, -0.70386913217899505), (-9.9900000000000002, -0.57965426709639589), (-9.9800000000000004, -0.24754817534860593), (-9.9700000000000006, 0.17987328163207517), (-9.9600000000000009, 0.55413639253598546), (-9.9500000000000011, 0.74256452644996929), (-9.9400000000000013, 0.67521336030495893), (-9.9300000000000015, 0.37046549973305676), (-9.9200000000000017, -0.069973092764842343), (-9.9100000000000019, -0.49529354803092768)]

In [84]:
import csv

filename = './code/io/csv_example/csv_write_example.csv'
with open(filename, 'wb') as csvfile:
    csv_writer = csv.writer(csvfile,delimiter=',')
    
    # if header lines are required, they can be written here
    header_line = ('Time (ms)', 'Voltage (V)')
    csv_writer.writerow(header_line)
    
    # write main block of data
    for xy_line in xy:
        csv_writer.writerow(xy_line)

That's it. More columns can be added simply by zipping more things together, e.g. zip(x,y,z,...). If you want to look at the csv file we just generated, it's here.

The Pickle module

Let's generate some data that takes a while to process. A large 2D-array should do the job nicely for now. Let's plot it as well, while we're on. And for comparison purposes, let's time how long it takes...


In [85]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import time

st = time.clock()

# make large arrays
x = np.arange(-100,100,0.1)
y = np.arange(-200,200,0.2)

X,Y = np.meshgrid(x,y)

Z = np.exp(-(X**2+Y**2)/80**2)*np.cos(np.sqrt(X**2+Y**2)/20)**2

print 'Elapsed time (ms):', (time.clock() - st)*1e3
plt.imshow(Z)


Elapsed time (ms): 297.011865079
Out[85]:
<matplotlib.image.AxesImage at 0x7cca898>

Ok, 4 million points (2k x 2k) takes a little while to process. Let's save that as a csv, and then save it by pickling, and then try and read them back in:


In [86]:
#save csv
st = time.clock()
import csv
fn_csv = './code/io/csv_example/big_array.csv'
with open(fn_csv,'wb') as csvfile:
    csv_writer = csv.writer(csvfile)
    for z_line in Z:
        csv_writer.writerow(z_line)
print 'How long did that take? (s)', time.clock() - st
# how big is this file..?
import os
print 'File size (MB):', os.path.getsize(fn_csv)/2**20


How long did that take? (s) 5.22988936046
File size (MB): 80

Quite a large file...

Let's try pickle instead...


In [87]:
#now pickle it instead
import cPickle as pickle

fn_pkl = './code/io/csv_example/big_array.pkl' # note you can have whatever extension you want here, 
                                           # or none at all, but I prefer a sensible extension

# pickle it
st = time.clock()
pickle.dump(Z,open(fn_pkl,'wb'))

print 'And this time... (s)', time.clock() - st
print 'File size (MB):', os.path.getsize(fn_pkl)/2**20


And this time... (s) 2.53754202893
File size (MB): 80

So the file sizes are the same, but the time taken is much longer by the csv writer. Now let's try reading them back in. Here's a pretty generic code for reading csv files to data arrays:


In [88]:
def read_file_data(filename):
    with open(filename,'U') as f:
        DataIn = csv.reader(f,delimiter=',')

        DataOut = []
        # find number of columns
        DataLine = DataIn.next()
        NCols = len(DataLine)
        for i in range(NCols):
            DataOut.append([])
        print NCols,len(DataOut)

        for row in DataIn:
            for i in range(NCols):
                try:
                    DataOut[i].append(float(row[i]))
                except: # if not numeric
                    DataOut[i].append(0)
    return DataOut

In [89]:
#read in csv file
st = time.clock()
Z = read_file_data(fn_csv)
print 'Elapsed time (ms):', (time.clock() - st)*1e3


2000 2000
Elapsed time (ms): 3473.91947552

Now compare with reading in from the pickled file


In [90]:
st = time.clock()
Z = pickle.load(open(fn_pkl,'rb'))
print 'Elapsed time (ms):', (time.clock() - st)*1e3


Elapsed time (ms): 2868.40562932

This might not seem like much of a speed-up, but if the file sizes get larger then this becomes a sizeable increase in performance.

However, the pickle module is most useful when storing many data types, as there is no need to format the data before saving:


In [92]:
# pickle example for mixed data types

# numpy 1d array
x = np.arange(-100,100,0.01)
# numpy 2d array
y = np.ones((500,500))
# list of mixed type
z = [1,4,6,0,'abcde']
# string
a = 'this is a string'
# tuple
b = (42,'anything')

Now let's say we want to save all of this data. Instead of writing many files, it can all be bundled into one pickled file:


In [93]:
fn_pkl = './code/io/csv_example/multi_out.pkl'
#save
pickle.dump([x,y,z,a,b],open(fn_pkl,'wb'))

And then to read it back in:


In [97]:
x2,y2,z2,a2,b2 = pickle.load(open(fn_pkl,'rb'))

#check the data is the same as what we put in...
print 'Data arrays same?\n', z2==z, a2==a, b2==b


Data arrays same?
True True True

In [ ]: