The goal of this lecture-lab is to learn how to extract data from files on your laptop's disk. We'll load words from a text file and numbers from data files. Along the way, we'll learn more about filenames and paths to files. The first two elements of our generic analytics program template says to acquire data and then load it into a data structure:
For now, we'll satisfy the first step by just downloading ready-made data files from the web by hand. In MSAN692 -- Data Acquisition, we'll learn all about how to pull data from the web programmatically. This lecture focuses on the second step in the analytics program template.
As we go along, I'm going to repeatedly ask you to type in a bunch of these examples. It's critical that you learn the code patterns associated with loading data from files. Please type in your code without cutting and pasting.
As we've discussed before, both the disk and RAM are forms of memory. RAM is much faster (but smaller) than the disk but RAM all disappears when the power goes out. Disks, on the other hand, are persistent. A file is simply a chunk of data on the disk identified by a filename. You use files all the time. For example, we can double-click on a text file or Excel file, which opens an application to display those files.
We need to be able to write Python programs that read data from files just like, say, Excel does. Accessing data in RAM is very easy in a Python program, we simply refer to the various elements in a list using an index, such as names[i]
. File data is less convenient to access because we have to explicitly load the file into working memory first. For example, we might want to load a list of names from a file into a names
list.
If a file is too big to fit into memory all at once, we have to process the data in chunks. For now, let's assume all files fit in memory.
Even so, accessing files is a bit of a hassle because we must explicitly tell Python to open a file and (often) then close it when we're done. We also must distinguish between reading and writing from/to a file, which dictates the mode in which we open the file. We can open the file in read, write, or append mode. For this lab, we will only concern ourselves with the default case of "opening a file for reading." Here is how to open a file called foo.txt
in read mode (the default) then immediately close that file:
f = open('foo.txt') # open for read mode
f.close() # ok, we're done
Hmm...what kind of object is returned from open()
and stored in f
? Why do we have to close files?
When we open a file, Python gives us a "file object" that is really just a handle or descriptor that the operating system gives us. It's a unique identifier and how the operating system likes to identify a file that we work with. The file object is not the filename and is also not the file itself on the disk. It's really just a descriptor and a reference to the file.
We will use a filename to get a file object using open()
and use the file object to get the file contents.
In [14]:
f = open("data/prices.txt") # or just "prices.txt"
print(type(f))
print(f)
f.close()
print(f.closed)
(Think of TextIOWrapper
as file.)
The close operation informs the operating system that you no longer need that resource. The operating system can only open so many files at once so you should close files when you're done using them.
Later, when you are learning to write data to files, the close operation is also important. Closing a file flushes any data in memory buffers that needs to be written. From the Python documentation:
"It is a common bug to write a program where you have the code to add all the data you want to a file, but the program does not end up creating a file. Usually this means you forgot to close the file."
The filename is a string that identifies a file on the disk. It can be fully qualified or relative to the current working directory.
The file object is not the filename and is also not the file itself on the disk. It's really just a descriptor and a reference to the file.
The contents of the file is different than the filename and the file (descriptor) object that Python gives us.
More recent versions of Python provide an excellent mechanism to avoid forgetting the file close operation. The with
statement is more general, but we will use it just for automatically closing files. Even if there is an exception inside the with
statement that forces the program to terminate, the close operation occurs. Here is how
In [25]:
with open("data/prices.txt") as f:
contents = f.read()
print(type(contents))
print(contents[0:10])
print(f.closed)
You know what a file name is because you've created lots of files before. (BTW, another reminder not to use spaces in your file or directory names.) Paths are unique specifiers or locators for directories or files. A fully-qualified filename gives a description of the directories from the root of the file system, separated by /
. The root of the file system is identified with /
(forward slash) at the start of a pathname. You are probably used to seeing it as "Macintosh HD" but from a programming point of view, it's just /
. On Windows, which we will not consider here, the root includes the drive specification and a backslash like C:\
. Here's a useful diagram showing the components of a fully qualified pathname to a file called view.py
:
As a shorthand, you can start a path with ~
, which means "my home directory". On a Mac that's /Users/parrt
or whatever your user ID is. On Linux, it's probably /home/parrt
.
The last element in a path is either a filename or a directory. For example to refer to the directory holding view.py
in the above diagram, use path /Users/parrt/classes/msan501/images-parrt
. Or, using the shortcut, the fully qualified path is ~/classes/msan501/images-parrt
. Here's an example bash session that uses some fully qualified paths:
$ ls /Users/parrt/classes/msan501/images-parrt/view.py
/Users/parrt/classes/msan501/images-parrt/view.py
$ cd /Users/parrt/classes/msan501/images-parrt
$ pwd
/Users/parrt/classes/msan501/images-parrt
$ cd ~/classes/msan501/images-parrt
$ pwd
/Users/parrt/classes/msan501/images-parrt
All programs run with the notion of a current working directory. So, if a program is running inside the directory ~/classes/msan501/images-parrt
, then the program could refer to any data files sitting in that directory with just a file name--no path is required. For example, let's use the ls
program to demonstrate the different kinds of paths.
$ cd ~/classes/msan501/images-parrt
$ ls
view.py
$ ls /Users/parrt/classes/msan501
images-parrt/
$ ls /Users/parrt/classes
msan501/
Any path that does not start with ~
or /
is called a relative pathname. For completeness, note that ..
means the directory above the current working directory:
$ cd ~/classes/msan501/images-parrt
$ ls ..
images-parrt/
$ ls ../..
msan501/
Sometimes you will see me use /tmp
, which is a temporary directory or dumping ground. All files in that directory are usually erased when you reboot.
As we discussed early in the semester, files are just bits. It's how we interpret the bits that is meaningful. The bits could represent an image, a movie, an article, data, Python program text, whatever. Let's call any file containing characters a text file and anything else a binary file.
Text files are usually 1 byte per character (8 bits) and have the notion of a line. A line is just a sequence of characters terminated with either \r\n
(Windows) or \n
(UNIX, Mac). A text file is usually then a sequence of lines. Download this sample text file, IntroIstanbul.txt so we have something to work with. You can save it in /tmp
or whatever directory you are using for in class work. For the purposes of this discussion, I have data files in a subdirectory called data
of this notes directory.
The first 10 lines of the file look like:
In [16]:
! head -10 data/IntroIstanbul.txt
You can ignore the "!
" on the front as it is just telling this Jupyter notebook to run the terminal command that follows. If you want you can think of !
as the $
terminal prompt in this context.
Now, let's examine the contents of the file in a raw fashion rather than with a text editor. The od
command (octal dump) is useful for looking at the bytes of the file. Use option -c
to see the contents as 1-byte characters:
In [17]:
! od -c data/IntroIstanbul.txt | head -5
That "| head -5
" pipes (the vertical bar "|
" looks like a pipe) the output of the od
command to the head
program, which gives this the first five lines of the output. When we have a lot of output we can also pipe the output to the more
program to paginate long output.
$ od -c data/IntroIstanbul.txt | more
...
The \n
character you see represents the single character we know as the carriage return. The numbers on the left are the character offsets into the file (it looks like they are octal not decimal, btw; use -A d
to get decimal addresses).
Let's look at some common programming patterns dealing with text files.
In [18]:
with open('data/IntroIstanbul.txt') as f:
contents = f.read() # read all content of the file
print(contents[0:200]) # print just the first 200 characters
In [19]:
with open('data/IntroIstanbul.txt') as f:
contents = f.read() # read all content of the file
words = contents.split(' ')
print(words[0:100]) # print first 100 words
Because we are splitting on the space character, newlines and multiple space characters in a row yield "words" that are not useful. We need to transform that list into a new list before it is useful.
Using the filter programming pattern filters words
for only those words greater than 1 character; place into another list called words2
. Hint len(s)
gets the length of string s
. [solutions]
Put all of this together by writing a function called getwords
that takes filename
as a parameter and returns the list of words greater than one character long. This is a combination of the "load all words of the file into a list" pattern and the previous exercise. [solutions]
Reading the contents of a file into a string is not always that useful. We typically want to deal with the words, as we just saw, or the lines of a text file. Natural language processing (NLP) would focus on using the words, but let's look at some data files, which typically structure files as lines of data. Each line represents an observation, data point, or record.
We could split the text contents by \n
to get the lines, but it is so common that Python provides functions to do that for us. To give us some data to play with, download prices.txt that has a list of prices, one price per line. Here's another very common programming pattern:
In [27]:
with open('data/prices.txt') as f:
prices = f.readlines() # get lines of file into a list
prices[0:10]
Out[27]:
Use the strip()
function on each element of the list so that you get: ['0.605', '0.600', '0.594', ...]
. Use a list comprehension to map the prices to a new version of the prices list. [solutions]
The numbers have the \n
character on the end but that's not a problem because we can easily convert that using NumPy:
In [23]:
import numpy as np
prices2 = np.array(prices, dtype=float) # convert to array of numbers
print(type(prices2))
print(prices2[0:10])
from lolviz import *
objviz(prices2)
Out[23]:
Add this conversion to the previous exercise and make sure you get an array
as output. (I'm trying to give you repeated experience typing code that reads data from a file and processes it in some way.) [solutions]
Let's look at a more complicated data file. Download heights.csv, which starts out like this:
In [19]:
! head -4 data/player-heights.csv
It is still a text file, but now we start to get the idea that text files can follow a particular format. In this case, we recognize it as a comma-separated value (CSV) file. It also has a header line that names the columns, which means we need to treat the first line differently than the remainder of the file.
In [2]:
import numpy as np
with open('data/player-heights.csv') as f:
lines = f.readlines()
lines = [line.strip() for line in lines] # remove \n on end
lines[0:5]
Out[2]:
In [3]:
header = lines[0]
data = lines[1:] # slice
# print it back out
print(header)
for d in data[0:5]:
print(d)
Each row of the data is a string with two numbers in it. We need to convert that string into a list with two floating-point numbers using split(',')
. Combining all of those two-element lists into an overall list gives us the two-dimensional table we need, which is our next exercise.
Write a function called getcsv(filename)
that returns a list of row lists, where the first row is the header row. Strip off any \n
char on the end of lines. The output should look like:
[['6.329999924', ' 6.079999924'], ['6.5', ' 6.579999924'], ['6.5', ' 6.25']]
Use list comprehensions where you can. [solutions]
import pandas as pd
and then convert the data from the previous exercise into a data frame. Pandas doesn't automatically understand that the first row is the header so slice out data[1:]
as the first argument to the pd.DataFrame()
data frame constructor and then pass data[0]
as the columns
parameter. Print it out and you should see something like:
Football height Basketball height
0 6.329999924 6.079999924
1 6.5 6.579999924
2 6.5 6.25
...
Of course, loading CSV is something that data scientists need to do all of the time and so there is a simple function you can use from Pandas, another library you will probably become very familiar with:
In [12]:
import pandas as pd
prices = pd.read_csv('data/prices.txt', header=None)
prices.head(5)
Out[12]:
(header=None
indicates that there are no column names in the first line of the file.)
This even works for CSV files with header rows:
In [13]:
data = pd.read_csv('data/player-heights.csv')
data.head(5)
Out[13]:
We'll see this stuff again in data frames.
The previous mechanism for getting lines of text into memory works well except that it requires we load everything into memory all at once. That is pretty inefficient and limits the size of the data we can process to the amount of memory we have.
In [18]:
n = 5
with open('data/prices.txt') as f:
for line in f: # for each line in the file
if n>0:
print(float(line)) # process the line in some way
n -= 1
Creative a function called getsum
that takes a filename string as a parameter and returns the sum of values from the lines in that file. Do not use the readlines()
function. Use a for line in f
type loop inside a list comprehensionTo manually get the list of lines. Call getsum
with filename data/prices.txt
and print out the sum of prices. Use sum(...)
to total the list created by your list comprehension.
The key programming patterns to take away from this lecture are:
You should be able to code those patterns quickly and easily, and without cutting and pasting from stackoverflow.