Loading files

The goal of this lecture-lab is to learn how to extract data from files on your laptop's disk. We'll load words from a text file and numbers from data files. Along the way, we'll learn more about filenames and paths to files. The first two elements of our generic analytics program template says to acquire data and then load it into a data structure:

  1. Acquire data, which means finding a suitable file or collecting data from the web and storing in a file
  2. Load data from disk and place into memory organized into data structures

For now, we'll satisfy the first step by just downloading ready-made data files from the web by hand. In MSAN692 -- Data Acquisition, we'll learn all about how to pull data from the web programmatically. This lecture focuses on the second step in the analytics program template.

As we go along, I'm going to repeatedly ask you to type in a bunch of these examples. It's critical that you learn the code patterns associated with loading data from files. Please type in your code without cutting and pasting.

What are files?

As we've discussed before, both the disk and RAM are forms of memory. RAM is much faster (but smaller) than the disk but RAM all disappears when the power goes out. Disks, on the other hand, are persistent. A file is simply a chunk of data on the disk identified by a filename. You use files all the time. For example, we can double-click on a text file or Excel file, which opens an application to display those files.

We need to be able to write Python programs that read data from files just like, say, Excel does. Accessing data in RAM is very easy in a Python program, we simply refer to the various elements in a list using an index, such as names[i]. File data is less convenient to access because we have to explicitly load the file into working memory first. For example, we might want to load a list of names from a file into a names list.

If a file is too big to fit into memory all at once, we have to process the data in chunks. For now, let's assume all files fit in memory.

Even so, accessing files is a bit of a hassle because we must explicitly tell Python to open a file and (often) then close it when we're done. We also must distinguish between reading and writing from/to a file, which dictates the mode in which we open the file. We can open the file in read, write, or append mode. For this lab, we will only concern ourselves with the default case of "opening a file for reading." Here is how to open a file called foo.txt in read mode (the default) then immediately close that file:

f = open('foo.txt')  # open for read mode
f.close()            # ok, we're done

Hmm...what kind of object is returned from open() and stored in f? Why do we have to close files?

File descriptors

When we open a file, Python gives us a "file object" that is really just a handle or descriptor that the operating system gives us. It's a unique identifier and how the operating system likes to identify a file that we work with. The file object is not the filename and is also not the file itself on the disk. It's really just a descriptor and a reference to the file.

We will use a filename to get a file object using open() and use the file object to get the file contents.


In [14]:
f = open("data/prices.txt") # or just "prices.txt"
print(type(f))
print(f)
f.close()
print(f.closed)


<class '_io.TextIOWrapper'>
<_io.TextIOWrapper name='data/prices.txt' mode='r' encoding='UTF-8'>
True

(Think of TextIOWrapper as file.)

The close operation informs the operating system that you no longer need that resource. The operating system can only open so many files at once so you should close files when you're done using them.

Later, when you are learning to write data to files, the close operation is also important. Closing a file flushes any data in memory buffers that needs to be written. From the Python documentation:

"It is a common bug to write a program where you have the code to add all the data you want to a file, but the program does not end up creating a file. Usually this means you forgot to close the file."

To help avoid confusion, keep this analogy in mind. Your house contents (file) is different than your address (file name) and different than a piece of paper with the address written on it (file descriptor). More specifically:

  1. The filename is a string that identifies a file on the disk. It can be fully qualified or relative to the current working directory.

  2. The file object is not the filename and is also not the file itself on the disk. It's really just a descriptor and a reference to the file.

  3. The contents of the file is different than the filename and the file (descriptor) object that Python gives us.

Python WITH statement

More recent versions of Python provide an excellent mechanism to avoid forgetting the file close operation. The with statement is more general, but we will use it just for automatically closing files. Even if there is an exception inside the with statement that forces the program to terminate, the close operation occurs. Here is how


In [25]:
with open("data/prices.txt") as f:
    contents = f.read()
    print(type(contents))
print(contents[0:10])
print(f.closed)


<class 'str'>
0.605
0.60
True

File names and paths

You know what a file name is because you've created lots of files before. (BTW, another reminder not to use spaces in your file or directory names.) Paths are unique specifiers or locators for directories or files. A fully-qualified filename gives a description of the directories from the root of the file system, separated by /. The root of the file system is identified with / (forward slash) at the start of a pathname. You are probably used to seeing it as "Macintosh HD" but from a programming point of view, it's just /. On Windows, which we will not consider here, the root includes the drive specification and a backslash like C:\. Here's a useful diagram showing the components of a fully qualified pathname to a file called view.py:

As a shorthand, you can start a path with ~, which means "my home directory". On a Mac that's /Users/parrt or whatever your user ID is. On Linux, it's probably /home/parrt.

The last element in a path is either a filename or a directory. For example to refer to the directory holding view.py in the above diagram, use path /Users/parrt/classes/msan501/images-parrt. Or, using the shortcut, the fully qualified path is ~/classes/msan501/images-parrt. Here's an example bash session that uses some fully qualified paths:

$ ls /Users/parrt/classes/msan501/images-parrt/view.py
/Users/parrt/classes/msan501/images-parrt/view.py
$ cd /Users/parrt/classes/msan501/images-parrt
$ pwd
/Users/parrt/classes/msan501/images-parrt
$ cd ~/classes/msan501/images-parrt
$ pwd
/Users/parrt/classes/msan501/images-parrt

Current working directory

All programs run with the notion of a current working directory. So, if a program is running inside the directory ~/classes/msan501/images-parrt, then the program could refer to any data files sitting in that directory with just a file name--no path is required. For example, let's use the ls program to demonstrate the different kinds of paths.

$ cd ~/classes/msan501/images-parrt
$ ls
view.py
$ ls /Users/parrt/classes/msan501
images-parrt/
$ ls /Users/parrt/classes
msan501/

Any path that does not start with ~ or / is called a relative pathname. For completeness, note that .. means the directory above the current working directory:

$ cd ~/classes/msan501/images-parrt
$ ls ..
images-parrt/
$ ls ../..
msan501/

Sometimes you will see me use /tmp, which is a temporary directory or dumping ground. All files in that directory are usually erased when you reboot.

Loading text files

As we discussed early in the semester, files are just bits. It's how we interpret the bits that is meaningful. The bits could represent an image, a movie, an article, data, Python program text, whatever. Let's call any file containing characters a text file and anything else a binary file.

Text files are usually 1 byte per character (8 bits) and have the notion of a line. A line is just a sequence of characters terminated with either \r\n (Windows) or \n (UNIX, Mac). A text file is usually then a sequence of lines. Download this sample text file, IntroIstanbul.txt so we have something to work with. You can save it in /tmp or whatever directory you are using for in class work. For the purposes of this discussion, I have data files in a subdirectory called data of this notes directory.

The first 10 lines of the file look like:


In [16]:
! head -10 data/IntroIstanbul.txt


  
  
    
      
        The City and ITS People
        Istanbul is one of the worlds most venerable cities. Part
        of the citys allure is its setting, where Europe faces Asia acr­oss
        the winding turquoise waters of the Bosphorus, making it the only city
        in the world to bridge two continents.

You can ignore the "!" on the front as it is just telling this Jupyter notebook to run the terminal command that follows. If you want you can think of ! as the $ terminal prompt in this context.

Now, let's examine the contents of the file in a raw fashion rather than with a text editor. The od command (octal dump) is useful for looking at the bytes of the file. Use option -c to see the contents as 1-byte characters:


In [17]:
! od -c data/IntroIstanbul.txt | head -5


0000000   \n          \n          \n                  \n                
0000020           \n                                   T   h   e       C
0000040    i   t   y       a   n   d       I   T   S       P   e   o   p
0000060    l   e  \n                                   I   s   t   a   n
0000100    b   u   l       i   s       o   n   e       o   f       t   h

That "| head -5" pipes (the vertical bar "|" looks like a pipe) the output of the od command to the head program, which gives this the first five lines of the output. When we have a lot of output we can also pipe the output to the more program to paginate long output.

$ od -c data/IntroIstanbul.txt | more
...

The \n character you see represents the single character we know as the carriage return. The numbers on the left are the character offsets into the file (it looks like they are octal not decimal, btw; use -A d to get decimal addresses).

Let's look at some common programming patterns dealing with text files.

Pattern: Load all file contents into a string.

The sequence of operation is open, load, close.


In [18]:
with open('data/IntroIstanbul.txt') as f:
    contents = f.read() # read all content of the file
print(contents[0:200]) # print just the first 200 characters


  
  
    
      
        The City and ITS People
        Istanbul is one of the worlds most venerable cities. Part
        of the citys allure is its setting, where Europe faces Asia acr­oss
       

Exercise

Without cutting and pasting, type in that sequence and make sure you can print the contents of the file from Python. Instead of data, use whatever directory you saved that IntroIstanbul.txt in.

Pattern: Load all words of file into a list.

This pattern is just an extension of the previous where we split() on the space character to get a list:


In [19]:
with open('data/IntroIstanbul.txt') as f:
    contents = f.read() # read all content of the file
words = contents.split(' ')
print(words[0:100]) # print first 100 words


['\n', '', '\n', '', '\n', '', '', '', '\n', '', '', '', '', '', '\n', '', '', '', '', '', '', '', 'The', 'City', 'and', 'ITS', 'People\n', '', '', '', '', '', '', '', 'Istanbul', 'is', 'one', 'of', 'the', 'worlds', 'most', 'venerable', 'cities.', 'Part\n', '', '', '', '', '', '', '', 'of', 'the', 'citys', 'allure', 'is', 'its', 'setting,', 'where', 'Europe', 'faces', 'Asia', 'acr\xadoss\n', '', '', '', '', '', '', '', 'the', 'winding', 'turquoise', 'waters', 'of', 'the', 'Bosphorus,', 'making', 'it', 'the', 'only', 'city\n', '', '', '', '', '', '', '', 'in', 'the', 'world', 'to', 'bridge', 'two', 'continents.\n', '', '', '', '']

Because we are splitting on the space character, newlines and multiple space characters in a row yield "words" that are not useful. We need to transform that list into a new list before it is useful.

Exercise

Using the filter programming pattern filters words for only those words greater than 1 character; place into another list called words2. Hint len(s) gets the length of string s. [solutions]

Exercise

Put all of this together by writing a function called getwords that takes filename as a parameter and returns the list of words greater than one character long. This is a combination of the "load all words of the file into a list" pattern and the previous exercise. [solutions]

Loading all lines of a file

Reading the contents of a file into a string is not always that useful. We typically want to deal with the words, as we just saw, or the lines of a text file. Natural language processing (NLP) would focus on using the words, but let's look at some data files, which typically structure files as lines of data. Each line represents an observation, data point, or record.

We could split the text contents by \n to get the lines, but it is so common that Python provides functions to do that for us. To give us some data to play with, download prices.txt that has a list of prices, one price per line. Here's another very common programming pattern:

Pattern: Read all of the lines of the file into a list.

The sequence is open, read lines, close:


In [27]:
with open('data/prices.txt') as f:
    prices = f.readlines() # get lines of file into a list
prices[0:10]


Out[27]:
['0.605\n',
 '0.600\n',
 '0.594\n',
 '0.592\n',
 '0.600\n',
 '0.616\n',
 '0.623\n',
 '0.628\n',
 '0.630\n',
 '0.629\n']

Exercise

Without cutting and pasting, type in that code and make sure you can read the lines of the file into a list.

Exercise

Use the strip() function on each element of the list so that you get: ['0.605', '0.600', '0.594', ...]. Use a list comprehension to map the prices to a new version of the prices list. [solutions]

Converting list of strings to numpy array

The numbers have the \n character on the end but that's not a problem because we can easily convert that using NumPy:


In [23]:
import numpy as np
prices2 = np.array(prices, dtype=float) # convert to array of numbers
print(type(prices2))
print(prices2[0:10])

from lolviz import *
objviz(prices2)


<class 'numpy.ndarray'>
[0.605 0.6   0.594 0.592 0.6   0.616 0.623 0.628 0.63  0.629]
Out[23]:
G node4589648272 0.605 0.6 0.594 0.592 0.6 ... 2.009 2.041 1.939 1.898 1.891

Exercise

Add this conversion to the previous exercise and make sure you get an array as output. (I'm trying to give you repeated experience typing code that reads data from a file and processes it in some way.) [solutions]

Loading CSV files

Let's look at a more complicated data file. Download heights.csv, which starts out like this:


In [19]:
! head -4 data/player-heights.csv


Football height, Basketball height
6.329999924, 6.079999924
6.5, 6.579999924
6.5, 6.25

It is still a text file, but now we start to get the idea that text files can follow a particular format. In this case, we recognize it as a comma-separated value (CSV) file. It also has a header line that names the columns, which means we need to treat the first line differently than the remainder of the file.

Pattern: Load a CSV file into a list of lists

We already know how to open a file and get the lines, so let's do that and also separate the lines into the header and the data components:


In [2]:
import numpy as np

with open('data/player-heights.csv') as f:
    lines = f.readlines()

lines = [line.strip() for line in lines] # remove \n on end
lines[0:5]


Out[2]:
['Football height, Basketball height',
 '6.329999924, 6.079999924',
 '6.5, 6.579999924',
 '6.5, 6.25',
 '6.25, 6.579999924']

In [3]:
header = lines[0]
data = lines[1:] # slice

# print it back out

print(header)
for d in data[0:5]:
    print(d)


Football height, Basketball height
6.329999924, 6.079999924
6.5, 6.579999924
6.5, 6.25
6.25, 6.579999924
6.5, 6.25

Exercise

As an exercise, type in that code and make sure you get the header and the first five lines of data printed out.

Exercise

Each row of the data is a string with two numbers in it. We need to convert that string into a list with two floating-point numbers using split(','). Combining all of those two-element lists into an overall list gives us the two-dimensional table we need, which is our next exercise.

Write a function called getcsv(filename) that returns a list of row lists, where the first row is the header row. Strip off any \n char on the end of lines. The output should look like:

[['6.329999924', ' 6.079999924'], ['6.5', ' 6.579999924'], ['6.5', ' 6.25']]

Use list comprehensions where you can. [solutions]

Exercise

import pandas as pd and then convert the data from the previous exercise into a data frame. Pandas doesn't automatically understand that the first row is the header so slice out data[1:] as the first argument to the pd.DataFrame() data frame constructor and then pass data[0] as the columns parameter. Print it out and you should see something like:

   Football height  Basketball height
0      6.329999924        6.079999924
1              6.5        6.579999924
2              6.5               6.25
...

[solutions]

Using Pandas to load CSV files

Of course, loading CSV is something that data scientists need to do all of the time and so there is a simple function you can use from Pandas, another library you will probably become very familiar with:


In [12]:
import pandas as pd
prices = pd.read_csv('data/prices.txt', header=None)
prices.head(5)


Out[12]:
0
0 0.605
1 0.600
2 0.594
3 0.592
4 0.600

(header=None indicates that there are no column names in the first line of the file.)

This even works for CSV files with header rows:


In [13]:
data = pd.read_csv('data/player-heights.csv')
data.head(5)


Out[13]:
Football height Basketball height
0 6.33 6.08
1 6.50 6.58
2 6.50 6.25
3 6.25 6.58
4 6.50 6.25

We'll see this stuff again in data frames.

Processing files line by line

The previous mechanism for getting lines of text into memory works well except that it requires we load everything into memory all at once. That is pretty inefficient and limits the size of the data we can process to the amount of memory we have.

Pattern: Read data line by line not all at once.

We can use a for-each loop where the sequence of data is the file descriptor:

with open('data/prices.txt') as f:
    for line in f: # for each line in the file
        print(float(line)) # process the line in some way

In [18]:
n = 5
with open('data/prices.txt') as f:
    for line in f: # for each line in the file
        if n>0:
            print(float(line)) # process the line in some way
        n -= 1


0.605
0.6
0.594
0.592
0.6

Exercise

Type in this new version of processing the lines of the file. No cutting and pasting!

Exercise

Creative a function called getsum that takes a filename string as a parameter and returns the sum of values from the lines in that file. Do not use the readlines() function. Use a for line in f type loop inside a list comprehensionTo manually get the list of lines. Call getsum with filename data/prices.txt and print out the sum of prices. Use sum(...) to total the list created by your list comprehension.

Operations after closing the file

Keep in mind that once you close the file, you cannot read anymore data from it:

f = open('data/prices.txt')
f.close()
f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file

Summary

The key programming patterns to take away from this lecture are:

  • Pattern: Load all file contents into a string.
  • Pattern: Load all words of file into a list.
  • Pattern: Read all of the lines of the file into a list.
  • Pattern: Load list of numbers into a numpy array.
  • Pattern: Load a CSV file into a 2D numpy array.

You should be able to code those patterns quickly and easily, and without cutting and pasting from stackoverflow.