by David Parades (david.paredes@durham.ac.uk) and James Keaveney (james.keaveney@durham.ac.uk)
In python, there are two ways of opening a data file:
open
functionUsually, we will deal with highly structured files, so the second method will be easier in most cases (in the case of big files this is no longer true).
In this tutorial, we will overview two modules/functions, numpy.loadtxt
and csv
, that allow you to get data from comma-separated-value files (csv), where the different data values reside in a file separated by commas, spaces or other delimiters (as the name suggests...)
The function numpy.loadtxt
also present as scipy.loadtxt
in the scipy
package, is very simple to use. You just use the name of the file in string format (that is, separated by normal or double-quotes) and the output is an array with the contents of the file.
For example, let's load the information in the file simpleDataset.csv and store it in variable myDataset
.
In [75]:
import numpy #Remember to import the module
myDataset = numpy.loadtxt("./code/io/csv_example/simpleDataset.csv")
print myDataset
The text file contains two columns of numbers separated by one space. By default, the function takes as a delimiter any white space.
If, for example, the values were separated by commas (as in simpleDatasetComma.txt), we would need to specify the variable delimiter
, which is a string containing the delimiting character(s).
In [76]:
myDatasetComma = numpy.loadtxt("./code/io/csv_example/simpleDatasetComma.txt", delimiter=',')
print myDatasetComma
Some files contain headers that are not part of the data (see fileWithHeader.csv). This file contains 3 lines that give information about the data, but that is not data. The header can be skipped by using keyword skiprows
In [77]:
complicatedDataset = numpy.loadtxt("./code/io/csv_example/fileWithHeader.csv",delimiter=',', skiprows=3)
print complicatedDataset
You can also select which columns to extract from the file with the keyword usecols
In [78]:
justSomeCols = numpy.loadtxt("./code/io/csv_example/fileWithHeader.csv",
delimiter=',', skiprows=3, usecols=(1,3))
print justSomeCols
You can get more information about numpy.loadtxt
in the scipy page, or using the help
function, as described in the page Basics - Help and information.
The module csv
has a syntax very similar to that of loadtxt
but is much more powerful in the sense that it can handle all sorts of data types. The official documentation states:
The lack of a standard (for csv files) means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources._
This modules provides classes to read and write tabular data from/to different formats. The most basic example of reading with this module would be:
In [79]:
import csv
with open("./code/io/csv_example/simpleDataset.csv", 'rb') as ultraImportantFile:
importantReader = csv.reader(ultraImportantFile, delimiter=' ')
for row in importantReader:
print row
As you can see (look at the quotation marks), the reader stores the data as a list of strings separated by the delimiter. This is a default, since a lot of csv files can have heterogeneous data. It is possible to use the function float
to convert the data to floats (for example), like
In [80]:
with open("./code/io/csv_example/simpleDataset.csv", 'rb') as ultraImportantFile:
data = list(csv.reader(ultraImportantFile, delimiter=' '))
print float(data[0][1])
It is also possible to give significative names to columns. For example, if the two columns in the file represent "current" and "voltage", using the DictReader function
In [81]:
currentData = []
voltData = []
cols = ['current', 'voltage']
with open("./code/io/csv_example/simpleDatasetComma.txt", 'rb') as csvfile:
for row in csv.DictReader(csvfile, fieldnames=cols, delimiter=','):
# Convert non-string data here e.g.:
thiscurrent = float(row['current'])
thisvoltage = float(row['voltage'])
currentData.append(thiscurrent)
voltData.append(thisvoltage)
print currentData, voltData
So far we've looked at reading files in, but occasionally it's useful to be able to save processed data. If the data needs to be human-read, or read again outside of the code it's written in, then outputting as a csv is useful. If this is not the case, for example you run a calculation that takes a long time and you need to save the result so that the next time the program is run you just look the result up instead of re-calculating, then pickled (just the binary) data is the easiest format to use. We will give examples of both below.
In [82]:
import numpy as np
#generate random data
x = np.arange(-10,10,0.01)
y = np.sin(3*x**2)*np.cos(x)**2
# print the first few lines
print x[0:10], y[0:10]
Now let's write this into a two-column csv file. We write row-by-row so first we need to convert the data into a 2d-array. We do this using the built-in zip function:
In [83]:
xy = zip(x,y)
# look at the first few lines of this - note the format is different to the previous block
print xy[0:10]
In [84]:
import csv
filename = './code/io/csv_example/csv_write_example.csv'
with open(filename, 'wb') as csvfile:
csv_writer = csv.writer(csvfile,delimiter=',')
# if header lines are required, they can be written here
header_line = ('Time (ms)', 'Voltage (V)')
csv_writer.writerow(header_line)
# write main block of data
for xy_line in xy:
csv_writer.writerow(xy_line)
That's it. More columns can be added simply by zipping more things together, e.g. zip(x,y,z,...). If you want to look at the csv file we just generated, it's here.
In [85]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import time
st = time.clock()
# make large arrays
x = np.arange(-100,100,0.1)
y = np.arange(-200,200,0.2)
X,Y = np.meshgrid(x,y)
Z = np.exp(-(X**2+Y**2)/80**2)*np.cos(np.sqrt(X**2+Y**2)/20)**2
print 'Elapsed time (ms):', (time.clock() - st)*1e3
plt.imshow(Z)
Out[85]:
Ok, 4 million points (2k x 2k) takes a little while to process. Let's save that as a csv, and then save it by pickling, and then try and read them back in:
In [86]:
#save csv
st = time.clock()
import csv
fn_csv = './code/io/csv_example/big_array.csv'
with open(fn_csv,'wb') as csvfile:
csv_writer = csv.writer(csvfile)
for z_line in Z:
csv_writer.writerow(z_line)
print 'How long did that take? (s)', time.clock() - st
# how big is this file..?
import os
print 'File size (MB):', os.path.getsize(fn_csv)/2**20
Quite a large file...
Let's try pickle instead...
In [87]:
#now pickle it instead
import cPickle as pickle
fn_pkl = './code/io/csv_example/big_array.pkl' # note you can have whatever extension you want here,
# or none at all, but I prefer a sensible extension
# pickle it
st = time.clock()
pickle.dump(Z,open(fn_pkl,'wb'))
print 'And this time... (s)', time.clock() - st
print 'File size (MB):', os.path.getsize(fn_pkl)/2**20
So the file sizes are the same, but the time taken is much longer by the csv writer. Now let's try reading them back in. Here's a pretty generic code for reading csv files to data arrays:
In [88]:
def read_file_data(filename):
with open(filename,'U') as f:
DataIn = csv.reader(f,delimiter=',')
DataOut = []
# find number of columns
DataLine = DataIn.next()
NCols = len(DataLine)
for i in range(NCols):
DataOut.append([])
print NCols,len(DataOut)
for row in DataIn:
for i in range(NCols):
try:
DataOut[i].append(float(row[i]))
except: # if not numeric
DataOut[i].append(0)
return DataOut
In [89]:
#read in csv file
st = time.clock()
Z = read_file_data(fn_csv)
print 'Elapsed time (ms):', (time.clock() - st)*1e3
Now compare with reading in from the pickled file
In [90]:
st = time.clock()
Z = pickle.load(open(fn_pkl,'rb'))
print 'Elapsed time (ms):', (time.clock() - st)*1e3
This might not seem like much of a speed-up, but if the file sizes get larger then this becomes a sizeable increase in performance.
However, the pickle module is most useful when storing many data types, as there is no need to format the data before saving:
In [92]:
# pickle example for mixed data types
# numpy 1d array
x = np.arange(-100,100,0.01)
# numpy 2d array
y = np.ones((500,500))
# list of mixed type
z = [1,4,6,0,'abcde']
# string
a = 'this is a string'
# tuple
b = (42,'anything')
Now let's say we want to save all of this data. Instead of writing many files, it can all be bundled into one pickled file:
In [93]:
fn_pkl = './code/io/csv_example/multi_out.pkl'
#save
pickle.dump([x,y,z,a,b],open(fn_pkl,'wb'))
And then to read it back in:
In [97]:
x2,y2,z2,a2,b2 = pickle.load(open(fn_pkl,'rb'))
#check the data is the same as what we put in...
print 'Data arrays same?\n', z2==z, a2==a, b2==b
In [ ]: