Data I/O

Based on Jake Vanderplas' intro to NumPy lesson for I/O with NumPy.


Instructions: Create a new directory called DataIO with a new notebook called DataIOTour. Give it a heading 1 cell title Tour of Data I/O. Read this page, typing in the code in the code cells and executing them as you go.

Do not copy/paste.

Type the commands yourself to get the practice doing it. This will also slow you down so you can think about the commands and what they are doing as you type them.</font>

Save your notebook when you are done, then try the accompanying exercises.


Introduction

First off, "I/O" stands for "input/output". Being able to read and write data to and from files is a critical part of programming, particularly for scientific applications. How you read and write data can depend on what that data is. For example, you may want to read in a simple Excel spreadsheet of data, stored as a "comma-separated-values" or csv file. Alternatively you may want to read in the data for an image stored as a jpeg file.

In this tour, we'll explore a few ways to work with data files that will depend on the kind of data and how it is stored on the computer.

Reading/writing text

Sometimes you want to write text to a file. That is your "data". Let's see how we can read/write string data. Since anything can be converted to a string and many strings can be converted to numbers, this can be very useful. First, create a file for us to read:


In [ ]:
%%file inout.dat
Here is a nice file
with a couple lines of text
it is a haiku

Read the whole file all at once:


In [ ]:
f = open('inout.dat')
print f.read()
f.close()

Read the file line by line, saving each line as a separate string in a python list:


In [ ]:
f = open('inout.dat')
print f.readlines()
f.close()

Notice the "\n" in the first couple of entries. Those are the linebreaks.

Here's another way to read line by line that splits the text in each line into separate list elements:


In [ ]:
for line in open('inout.dat'):
    print line.split()

write() is the opposite of read(). When we open a file for writing, we tell it the mode 'w', which means "write". We could have used the mode 'r' for the other open commands but that is the default mode unless otherwise specified, as in this case:


In [ ]:
contents = open('inout.dat').read()
out = open('my_output.dat', 'w')
out.write(contents.replace(' ', '_'))
out.close()

In [ ]:
!cat my_output.dat

In [ ]:
# writelines() is the opposite of readlines()
lines = open('inout.dat').readlines()
out = open('my_output.dat', 'w')
out.writelines(lines)
out.close()

In [ ]:
!cat my_output.dat

NumPy data I/O


In [ ]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plt

NumPy lets you read and write arrays into files in a number of ways. In order to use these tools well, it is critical to understand the difference between a text and a binary file containing numerical data. In a text file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, in this case with 15 decimal digits. In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable pi had in the computer's memory.

The tradeoffs between the two modes are thus:

  • Text mode: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor. Can only be used for one- and two-dimensional arrays.
  • Binary mode: compact and exact representation of the data in memory, can't be read or edited by hand. Arrays of any size and dimensionality can be saved and read without loss of information.

Text Data

First, let's see how to read and write arrays in text mode. The np.savetxt function saves an array to a text file, with options to control the precision, separators and even adding a header:


In [ ]:
arr = np.arange(10).reshape(2, 5)
np.savetxt('test.out', arr, fmt='%.2e', header="My dataset")
!cat test.out

The fmt keyword lets you set the format for the values that are written to the file. More on that in a second.

Lines that start with a hashtag (#) are ignored as python comments. Blank lines are also ignored. It is a good idea to put a header - lines with explanatory comments at the beginning of data files - because you will quickly forget what the numbers mean. Here the header is short and trivial, but it doesn't have to be. You could have a longer string (several paragraphs of text with newlines "\n" in it, for example) stored as a variable that you pass to the header keyword of savetxt. Giving files descriptive names is also helpful.

And this same type of file can then be read with the matching np.loadtxt function:


In [ ]:
DataIn = np.loadtxt('test.out')
print DataIn.shape
print DataIn

You can see that DataIn is a 2-dimensional array, each dimension containing 5 numbers. You can work with the different dimensions by slicing the array:


In [ ]:
print DataIn[1,:]

You can also extract each column of numbers into a separate 1-dimensional array. Setting the argument unpack=True and providing a variable for each column accomplishes this.


In [ ]:
a, b, c, d, x = np.loadtxt('test.out', unpack=True)
print a
print b
print c
print d
print x

If you want to read in only some columns, you can use the usecols argument to specify which ones. (Recall that indices in Python start from zero, not one.) The line below will read only the first and second columns of data, so only two variable names are provided.


In [ ]:
a, b = np.loadtxt('test.out',unpack=True, usecols=[0,1])
print a
print b

Dealing with csv files

Oftentimes we have a file that has been written as a "comma-separated-values" or csv file. Excel can read/write csv files and many experimental devices controlled by computers can write collected data as csv files. You will encounter these a lot in PHYS 340. To read in data from a csv file, we just have to let loadtxt know that the fields in our data file are separated by commas. We pass an extra argument that defines the field delimiter.


In [ ]:
%%file input.csv
# My csv example data
    0.0,  1.1,  0.1
    2.0,  1.9,  0.2
    4.0,  3.2,  0.1
    6.0,  4.0,  0.3
    8.0,  5.9,  0.3

In [ ]:
!cat input.csv

In [ ]:
#throws an error because commas are not part of floating point numbers
x, y = np.loadtxt('input.csv',unpack=True, usecols=[0,1,2])

In [ ]:
x, y = np.loadtxt('input.csv',unpack=True, delimiter=',', usecols=[0,1])
print x,y

Note: np.genfromtxt is like np.loadtxt, but it can handle missing data. It sets all missing values to np.nan. This can be useful when your data has gaps in it. You can filter out the np.nans from the array you read in without losing anything.

More on writing data

Suppose that you’ve read two columns of data into the arrays t for time and v for the voltage from a pressure sensor. Here are the values you read in:


In [ ]:
t = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
v = np.array([0.137,0.456,0.591,0.713,0.859,0.926,1.139,1.327,1.512,1.875])

Also, suppose that the manual for the sensor gives the following equation to find the pressure in atmospheres from the voltage reading.

$p = 0.15 + v/10.0$

You can write a single command to create the p array from the v array.


In [ ]:
p = 0.15 + v/10.0

Once you’ve calculated the pressures, you might want to write the times and pressures to a text file for later use. The following command will write t and p to the file “output.dat”. The file will be saved in the same directory as the program. If you give the name of an existing file, it will be overwritten, so be careful!


In [ ]:
np.savetxt('output.dat', (t,p))

Unfortunately, each of the arrays will appear in a different row, which is not very human-readable, and is inconvenient for large data sets.

The column_stack function can be used to put each array written into a different column. The arguments should be a list of arrays (the inner pair of brackets make it a list) in the order that you want them to appear. The column_stack function stacks each of the arrays into a column of an array called DataOut, which is written to the text file.


In [ ]:
DataOut = np.column_stack((t,p))
np.savetxt('output.dat', DataOut)

In [ ]:
!cat output.dat

By default, the numbers will be written in scientific notation. The fmt argument can be used to specify the formatting. If one format is supplied, it will be used for all of the numbers. The general form of the fmt argument is

fmt = '%(width).(precision)(specifier)'

where width specifies the maximum number of digits, precision specifies the number of digits after the decimal point, and the possibilities for specifier are shown below. For integer formatting, the precision argument is ignored if you give it. For scientific notation and floating point formatting, the width argument is optional.

Specifier Meaning Example Format Output for -34.5678
i signed integer %5i -34
e scientific notation %5.4e -3.4568e+001
f floating point %5.2f -34.57

A format can also be provided for each column (two in this case) as follows.


In [ ]:
np.savetxt('output.dat', DataOut, fmt=('%3i', '%4.3f'))

In [ ]:
!cat output.dat

Let's add a header comment to this file so we don't forget what we wrote in it:


In [ ]:
myheader ="\nTime and Pressure data\nt (s) p (Pa)\n"
np.savetxt('output.dat', DataOut, fmt=('%3i', '%4.3f'),header=myheader)
!cat output.dat

Binary data

For binary data, NumPy provides the np.save and np.savez routines. The first saves a single array to a file with .npy extension, while the latter can be used to save a group of arrays into a single file with .npz extension. The files created with these routines can then be read with the np.load function.

Let us first see how to use the simpler np.save function to save a single array:


In [ ]:
arr2 = DataIn #copy the array from before
np.save('test.npy', arr2)
# Now we read this back
arr2n = np.load('test.npy')
# Let's see if any element is non-zero in the difference.
# A value of True would be a problem.
print 'Any differences?', np.any(arr2-arr2n)

Now let us see how the np.savez function works. You give it a filename and either a sequence of arrays or a set of keywords. In the first mode, the function will auotmatically name the saved arrays in the archive as arr_0, arr_1, etc:


In [ ]:
np.savez('test.npz', arr, arr2)
arrays = np.load('test.npz')
arrays.files

.npz: multiple binary outputs in one file

Alternatively, we can explicitly choose how to name the arrays we save:


In [ ]:
np.savez('test.npz', array1=arr, array2=arr2)
arrays = np.load('test.npz')
arrays.files

The object returned by np.load from an .npz file works like a dictionary, though you can also access its constituent files by attribute using its special .f field; this is best illustrated with an example with the arrays object from above:


In [ ]:
print 'First row of first array:', arrays['array1'][0]
# This is an equivalent way to get the same field
print 'First row of first array:', arrays.f.array1[0]

This .npz format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem. At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF or HDF5.

Other data formats

Fortunately, there are tools for manipulating these formats in Python, and for storing data in other ways such as databases. A complete discussion of the possibilities is beyond the scope of this discussion, but of particular interest for scientific users we at least mention that the scipy.io module contains routines to read and write Matlab files in .mat format and files in the NetCDF format that is widely used in certain scientific disciplines.


Image data

We'll do a separate lesson on images in another notebook.


All content is under a modified MIT License, and can be freely used and adapted. See the full license text here.