Based on Jake Vanderplas' intro to NumPy lesson for I/O with NumPy.
Instructions: Create a new directory called DataIO
with a new notebook called DataIOTour
. Give it a heading 1 cell title Tour of Data I/O. Read this page, typing in the code in the code cells and executing them as you go.
Do not copy/paste.
Type the commands yourself to get the practice doing it. This will also slow you down so you can think about the commands and what they are doing as you type them.</font>
Save your notebook when you are done, then try the accompanying exercises.
First off, "I/O" stands for "input/output". Being able to read and write data to and from files is a critical part of programming, particularly for scientific applications. How you read and write data can depend on what that data is. For example, you may want to read in a simple Excel spreadsheet of data, stored as a "comma-separated-values" or csv
file. Alternatively you may want to read in the data for an image stored as a jpeg
file.
In this tour, we'll explore a few ways to work with data files that will depend on the kind of data and how it is stored on the computer.
Sometimes you want to write text to a file. That is your "data". Let's see how we can read/write string data. Since anything can be converted to a string and many strings can be converted to numbers, this can be very useful. First, create a file for us to read:
In [ ]:
%%file inout.dat
Here is a nice file
with a couple lines of text
it is a haiku
Read the whole file all at once:
In [ ]:
f = open('inout.dat')
print f.read()
f.close()
Read the file line by line, saving each line as a separate string in a python list:
In [ ]:
f = open('inout.dat')
print f.readlines()
f.close()
Notice the "\n
" in the first couple of entries. Those are the linebreaks.
Here's another way to read line by line that splits the text in each line into separate list elements:
In [ ]:
for line in open('inout.dat'):
print line.split()
write()
is the opposite of read()
. When we open
a file for writing, we tell it the mode 'w'
, which means "write". We could have used the mode 'r'
for the other open
commands but that is the default mode unless otherwise specified, as in this case:
In [ ]:
contents = open('inout.dat').read()
out = open('my_output.dat', 'w')
out.write(contents.replace(' ', '_'))
out.close()
In [ ]:
!cat my_output.dat
In [ ]:
# writelines() is the opposite of readlines()
lines = open('inout.dat').readlines()
out = open('my_output.dat', 'w')
out.writelines(lines)
out.close()
In [ ]:
!cat my_output.dat
In [ ]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plt
NumPy lets you read and write arrays into files in a number of ways. In order to use these tools well, it is critical to understand the difference between a text
and a binary
file containing numerical data. In a text
file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, in this case with 15 decimal digits. In contrast, that same number written to a binary
file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable pi had in the computer's memory.
The tradeoffs between the two modes are thus:
First, let's see how to read and write arrays in text mode. The np.savetxt
function saves an array to a text file, with options to control the precision, separators and even adding a header:
In [ ]:
arr = np.arange(10).reshape(2, 5)
np.savetxt('test.out', arr, fmt='%.2e', header="My dataset")
!cat test.out
The fmt
keyword lets you set the format for the values that are written to the file. More on that in a second.
Lines that start with a hashtag (#
) are ignored as python comments. Blank lines are also ignored. It is a good idea to put a header - lines with explanatory comments at the beginning of data files - because you will quickly forget what the numbers mean. Here the header is short and trivial, but it doesn't have to be. You could have a longer string (several paragraphs of text with newlines "\n
" in it, for example) stored as a variable that you pass to the header
keyword of savetxt
. Giving files descriptive names is also helpful.
And this same type of file can then be read with the matching np.loadtxt
function:
In [ ]:
DataIn = np.loadtxt('test.out')
print DataIn.shape
print DataIn
You can see that DataIn is a 2-dimensional array, each dimension containing 5 numbers. You can work with the different dimensions by slicing the array:
In [ ]:
print DataIn[1,:]
You can also extract each column of numbers into a separate 1-dimensional array. Setting the argument unpack=True
and providing a variable for each column accomplishes this.
In [ ]:
a, b, c, d, x = np.loadtxt('test.out', unpack=True)
print a
print b
print c
print d
print x
If you want to read in only some columns, you can use the usecols
argument to specify which ones. (Recall that indices in Python start from zero, not one.) The line below will read only the first and second columns of data, so only two variable names are provided.
In [ ]:
a, b = np.loadtxt('test.out',unpack=True, usecols=[0,1])
print a
print b
Oftentimes we have a file that has been written as a "comma-separated-values" or csv
file. Excel can read/write csv
files and many experimental devices controlled by computers can write collected data as csv
files. You will encounter these a lot in PHYS 340. To read in data from a csv
file, we just have to let loadtxt
know that the fields in our data file are separated by commas. We pass an extra argument that defines the field delimiter
.
In [ ]:
%%file input.csv
# My csv example data
0.0, 1.1, 0.1
2.0, 1.9, 0.2
4.0, 3.2, 0.1
6.0, 4.0, 0.3
8.0, 5.9, 0.3
In [ ]:
!cat input.csv
In [ ]:
#throws an error because commas are not part of floating point numbers
x, y = np.loadtxt('input.csv',unpack=True, usecols=[0,1,2])
In [ ]:
x, y = np.loadtxt('input.csv',unpack=True, delimiter=',', usecols=[0,1])
print x,y
Note: np.genfromtxt
is like np.loadtxt
, but it can handle missing data. It sets all missing values to np.nan
. This can be useful when your data has gaps in it. You can filter out the np.nan
s from the array you read in without losing anything.
Suppose that you’ve read two columns of data into the arrays t
for time and v
for the voltage from a pressure sensor. Here are the values you read in:
In [ ]:
t = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
v = np.array([0.137,0.456,0.591,0.713,0.859,0.926,1.139,1.327,1.512,1.875])
Also, suppose that the manual for the sensor gives the following equation to find the pressure in atmospheres from the voltage reading.
$p = 0.15 + v/10.0$
You can write a single command to create the p
array from the v
array.
In [ ]:
p = 0.15 + v/10.0
Once you’ve calculated the pressures, you might want to write the times and pressures to a text file for later use. The following command will write t
and p
to the file “output.dat
”. The file will be saved in the same directory as the program. If you give the name of an existing file, it will be overwritten, so be careful!
In [ ]:
np.savetxt('output.dat', (t,p))
Unfortunately, each of the arrays will appear in a different row, which is not very human-readable, and is inconvenient for large data sets.
The column_stack
function can be used to put each array written into a different column. The arguments should be a list of arrays (the inner pair of brackets make it a list) in the order that you want them to appear. The column_stack
function stacks each of the arrays into a column of an array called DataOut
, which is written to the text file.
In [ ]:
DataOut = np.column_stack((t,p))
np.savetxt('output.dat', DataOut)
In [ ]:
!cat output.dat
By default, the numbers will be written in scientific notation. The fmt
argument can be used to specify the formatting. If one format is supplied, it will be used for all of the numbers. The general form of the fmt
argument is
fmt = '%(width).(precision)(specifier)'
where width
specifies the maximum number of digits, precision
specifies the number of digits after the decimal point, and the possibilities for specifier
are shown below. For integer formatting, the precision
argument is ignored if you give it. For scientific notation and floating point formatting, the width
argument is optional.
Specifier | Meaning | Example Format | Output for -34.5678 |
---|---|---|---|
i | signed integer | %5i | -34 |
e | scientific notation | %5.4e | -3.4568e+001 |
f | floating point | %5.2f | -34.57 |
A format can also be provided for each column (two in this case) as follows.
In [ ]:
np.savetxt('output.dat', DataOut, fmt=('%3i', '%4.3f'))
In [ ]:
!cat output.dat
Let's add a header comment to this file so we don't forget what we wrote in it:
In [ ]:
myheader ="\nTime and Pressure data\nt (s) p (Pa)\n"
np.savetxt('output.dat', DataOut, fmt=('%3i', '%4.3f'),header=myheader)
!cat output.dat
For binary data, NumPy provides the np.save
and np.savez
routines. The first saves a single array to a file with .npy
extension, while the latter can be used to save a group of arrays into a single file with .npz
extension. The files created with these routines can then be read with the np.load
function.
Let us first see how to use the simpler np.save
function to save a single array:
In [ ]:
arr2 = DataIn #copy the array from before
np.save('test.npy', arr2)
# Now we read this back
arr2n = np.load('test.npy')
# Let's see if any element is non-zero in the difference.
# A value of True would be a problem.
print 'Any differences?', np.any(arr2-arr2n)
Now let us see how the np.savez
function works. You give it a filename and either a sequence of arrays or a set of keywords. In the first mode, the function will auotmatically name the saved arrays in the archive as arr_0
, arr_1
, etc:
In [ ]:
np.savez('test.npz', arr, arr2)
arrays = np.load('test.npz')
arrays.files
Alternatively, we can explicitly choose how to name the arrays we save:
In [ ]:
np.savez('test.npz', array1=arr, array2=arr2)
arrays = np.load('test.npz')
arrays.files
The object returned by np.load
from an .npz
file works like a dictionary, though you can also access its constituent files by attribute using its special .f
field; this is best illustrated with an example with the arrays object from above:
In [ ]:
print 'First row of first array:', arrays['array1'][0]
# This is an equivalent way to get the same field
print 'First row of first array:', arrays.f.array1[0]
This .npz
format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem. At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF
or HDF5
.
Fortunately, there are tools for manipulating these formats in Python, and for storing data in other ways such as databases. A complete discussion of the possibilities is beyond the scope of this discussion, but of particular interest for scientific users we at least mention that the scipy.io
module contains routines to read and write Matlab files in .mat
format and files in the NetCDF
format that is widely used in certain scientific disciplines.
We'll do a separate lesson on images in another notebook.