COP3990C - Python Programming


In [2]:
# this block is just for the style sheet for the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[2]:

In [4]:
# file seperator - your program needs to be multiplatfom
import os

file_name = 'data' + os.sep + 'test'
print file_name
fin = open(file_name,'rb')
lines = fin.readlines()
fin.close()


data\test

In [5]:
print lines


[' Volume in drive C has no label.\r\n', ' Volume Serial Number is 92E0-AA74\r\n', '\r\n', ' Directory of C:\\Users\\User\\Dropbox\\UWF\\spring2014\\Python\\UWF_2014_spring_COP3990C-2507\\notebooks\\data\r\n', '\r\n', '01/29/2014  03:02 PM    <DIR>          .\r\n', '01/29/2014  03:02 PM    <DIR>          ..\r\n', '01/27/2014  05:07 PM                29 csv_file.csv\r\n', '01/27/2014  05:13 PM               565 csv_file2.csv\r\n', '01/26/2014  10:06 AM               567 csv_file3.csv\r\n', '01/27/2014  05:13 PM                33 csv_out.csv\r\n', '01/27/2014  05:18 PM               115 csv_output_2.csv\r\n', '01/27/2014  05:04 PM               105 out_file.txt\r\n', '01/29/2014  03:02 PM                 0 test\r\n', '01/27/2014  04:31 PM               807 test-linux\r\n', '01/27/2014  04:22 PM               140 test.py\r\n', '01/26/2014  09:18 AM                45 text_file2.txt\r\n', '              10 File(s)          2,406 bytes\r\n', '               2 Dir(s)  312,003,645,440 bytes free\r\n']

In [6]:
#file_name = 'data' + os.sep + 'test'
fin = open(r'data\test','r')
lines = fin.readlines()
fin.close()

In [7]:
lines


Out[7]:
[' Volume in drive C has no label.\n',
 ' Volume Serial Number is 92E0-AA74\n',
 '\n',
 ' Directory of C:\\Users\\User\\Dropbox\\UWF\\spring2014\\Python\\UWF_2014_spring_COP3990C-2507\\notebooks\\data\n',
 '\n',
 '01/29/2014  03:02 PM    <DIR>          .\n',
 '01/29/2014  03:02 PM    <DIR>          ..\n',
 '01/27/2014  05:07 PM                29 csv_file.csv\n',
 '01/27/2014  05:13 PM               565 csv_file2.csv\n',
 '01/26/2014  10:06 AM               567 csv_file3.csv\n',
 '01/27/2014  05:13 PM                33 csv_out.csv\n',
 '01/27/2014  05:18 PM               115 csv_output_2.csv\n',
 '01/27/2014  05:04 PM               105 out_file.txt\n',
 '01/29/2014  03:02 PM                 0 test\n',
 '01/27/2014  04:31 PM               807 test-linux\n',
 '01/27/2014  04:22 PM               140 test.py\n',
 '01/26/2014  09:18 AM                45 text_file2.txt\n',
 '              10 File(s)          2,406 bytes\n',
 '               2 Dir(s)  312,003,645,440 bytes free\n']

In [8]:
len(lines[0])


Out[8]:
33

Types of files we'll look at

In this course we'll look at the following file types:

  • plain text
  • comma separated values (CSV)
  • binary
  • pickled
  • panda's frames, hdf5, Matlab mat files, etc ...
  • </ul> Later we'll use numpy functions to read text/csv/binary files into arrays and data structures directly

    Plain Text Files

    Data can be stored in plain text files with arbitrary delimiters (space, tabs, commas). Data does not need to be organized into an equal number of elements per "record" or line. Plain text files (also CSV files) are structured as a set of lines terminated by end of line characters, '\n'. We don't need to import any libraries to start working with files. The following functions are used frequently:

    • open(). This created a file object (handle) that makes the files contents accessible to the program. A file can be opened in different modes. The default mode is "read" ('r'). Other modes are "write" ('w'), "append" ('a'), "read binary" ('rb'), "write binary" ('wb'), and "read and write" ('r+')
    • close() . This function closes the access to the file. If you forget to issue this command Python will take care of the garbage collection and close the stream. It's however strongly encouraged to close the stream yourself when you're done reading the necessary data.
    • read(). This function reads the contents of the file as one (huge) string. You can also specify the number of characters to read by passing that number to the function.
    • read(n) . This reads $n$ bytes from the file.
    • readline(). This functions reads the contents of a file one line at a time as a string. Remember that lines are terminated by a the end of line character '\n'.
    • readlines(). This function reads the whole file as a list of lines as strings.
    • write(): writes one string at a time to the file.
    • seek(offset, initial location or refernce point): moves the pointer by the specified offset from the initial location. The initial location or the reference point can take on 3 values: 0 - the beginning of the file, 1 - the current location, 2 - the end of the file.
    • tell() : tell the location of the pointer in the file.

    Examples

    
    
    In [9]:
    # open an existing  file for reading
    
    import os # needed for the file separation
    
    file_name = 'data' + os.sep + 'text_file2.txt'
    print file_name
    my_file = open(file_name, 'rb')
    
    # read the content of the whole file then close
    file_content = my_file.read()
    my_file.close()
    
    # print the conent of the file
    print file_content
    
    
    
    
    data\text_file2.txt
    123456789
    123456789!
    123456789*
    123456789-
    
    
    
    In [10]:
    my_file = open(file_name, 'rb')
    
    
    
    In [11]:
    # read 10 bytes
    characters = my_file.read(10)
    print 'the first 10 bytes:', characters
    
    
    
    
    the first 10 bytes: 123456789
    
    
    
    In [12]:
    # read the next 10 bytes - recall that the end of line characted is counted as 
    # part of the string
    characters = my_file.read(10)
    print 'the second 10 bytes', characters
    ord_list = [ord(x) for x in characters]
    print 'length of list is: ', len(ord_list)
    print 'list is ', ord_list
    
    
    
    
    the second 10 bytes 
    123456789
    length of list is:  10
    list is  [10, 49, 50, 51, 52, 53, 54, 55, 56, 57]
    
    
    
    In [13]:
    characters = my_file.read(3)
    print [ord(x) for x in characters]
    
    
    
    
    [33, 13, 10]
    
    
    
    In [14]:
    # the pointer's location after reading 20 bytes
    print 'the pointer is at poistion ', my_file.tell()
    
    
    
    
    the pointer is at poistion  23
    
    
    
    In [15]:
    # read the next byte
    print 'the 24tht byte is ', my_file.read(1)
    
    
    
    
    the 24tht byte is  1
    
    
    
    In [16]:
    # move to the beginning of the file by using the seek function
    my_file.seek(0,0)
    
    
    
    In [17]:
    # skip 20 bytes (charactes) and read the one right after
    print 'after seeking to the beginning of the file we are at location:', my_file.tell()
    my_file.seek(20, 1)
    print 'the 20th byte is:', my_file.read(1)
    
    
    
    
    after seeking to the beginning of the file we are at location: 0
    the 20th byte is: !
    
    
    
    In [18]:
    # skip the first 3 bytes from the beginning of the file
    my_file.seek(3, 0)
    print 'i am @ ', my_file.tell()
    
    
    
    
    i am @  3
    
    
    
    In [19]:
    # read the 4th byte
    print 'the 4th byte is ', my_file.read(1)
    print 'we are @ : ', my_file.tell()
    
    
    
    
    the 4th byte is  4
    we are @ :  4
    
    
    
    In [20]:
    # skip 10 byte from this location
    my_file.seek(10, 1)
    
    
    
    In [21]:
    # get the current location and the character there
    print 'we are at ', my_file.tell()
    print my_file.read(1)
    
    
    
    
    we are at  14
    4
    
    
    
    In [22]:
    # read the rest of the file from this point on and print it
    print 'the rest of the file is:\n', my_file.read()
    
    
    
    
    the rest of the file is:
    56789!
    123456789*
    123456789-
    
    
    
    In [23]:
    # read the last 3 bytes before the end of the file
    my_file.seek(-3, 2)
    print my_file.read()
    
    
    
    
    89-
    
    
    
    In [24]:
    # read a byte now that we've reached the end of the file
    print 'the charachter at the end of the file is ', my_file.read()
    
    
    
    
    the charachter at the end of the file is  
    
    
    
    In [25]:
    # close the file
    my_file.close()
    
    
    
    In [26]:
    # now let's read one line at a time
    
    # open an existing  file for reading
    file_name = 'data' + os.sep + 'text_file2.txt'
    my_file = open(file_name, 'rb')
    
    # read the first line
    first_line = my_file.readline()
    print first_line
    
    # read the second line
    second_line = my_file.readline()
    print second_line
    
    # close file
    my_file.close()
    
    
    
    
    123456789
    
    123456789!
    
    
    
    
    In [27]:
    # note: when readline() is called, a line is read and the "pointer" is 
    # moved to the next line within the file; reading is done sequentially. 
    # So if you want the third line you have to go through the first two and ignoring them.
    
    # open an existing  file for reading
    file_name = 'data' + os.sep + 'text_file2.txt'
    my_file = open(file_name, 'r')
    
    # read the first and second lines and ignore them - this can go in a loop if you like
    line = my_file.readline()
    line = my_file.readline()
    
    # read the third line
    line = my_file.readline()
    print line
    
    # close the file
    my_file.close()
    
    
    
    
    123456789*
    
    
    
    
    In [28]:
    # read the whole file as a list contating each line as an element
    
    
    # open an existing  file for reading
    file_name = 'data' + os.sep + 'text_file2.txt'
    my_file = open(file_name, 'rb')
    
    # read the file as a list and close
    data = my_file.readlines()
    my_file.close()
    
    
    # show the list - notice the end of line character in each element
    print data
    
    
    
    
    ['123456789\r\n', '123456789!\r\n', '123456789*\r\n', '123456789-']
    
    
    
    In [29]:
    # loop though the list and print the lines - the EOL character will be printed as a 
    # new line so in the addition to the new line provided by the print statement we also 
    # get one from  the string itself
    for line in data:
        print line
    
    
    
    
    123456789
    
    123456789!
    
    123456789*
    
    123456789-
    
    
    
    In [30]:
    # loop though the list and print the lines without the extra EOL charachter
    for line in data:
        print line.strip()
    
    
    
    
    123456789
    123456789!
    123456789*
    123456789-
    
    
    
    In [31]:
    write some data to a file - if the file does not exit then it gets created 
    # otherwise it gets overwritten
    out_file = 'data' + os.sep + 'out_file.txt'
    test_out = open(out_file, 'w')
    
    
    
    
      File "<ipython-input-31-657fa0b8a45a>", line 1
        write some data to a file - if the file does not exit then it gets created
        ^
    IndentationError: unexpected indent
    
    
    
    In [32]:
    # define a string
    str1 = 'my first line'
    str2 = 'my second line'
    
    # write the strings
    test_out.write(str1)
    test_out.write(str2)
    
    # close the file
    test_out.close()
    
    
    
    
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-32-c2d983a6b2e3> in <module>()
          4 
          5 # write the strings
    ----> 6 test_out.write(str1)
          7 test_out.write(str2)
          8 
    
    NameError: name 'test_out' is not defined
    
    
    In [33]:
    # if you open the file you'd see that the lines are written back to back with out 
    # an EOL character - to avoid this ammend a EOL to the string
    out_file = 'data' + os.sep + 'out_file.txt'
    test_out = open(out_file, 'w')
    
    # define a string
    str1 = 'my first line\n'
    str2 = 'my second line'
    
    # write the strings
    test_out.write(str1)
    test_out.write(str2)
    
    # close the file
    test_out.close()
    
    
    
    In [34]:
    # another way to write the two lines
    out_file = 'data' + os.sep + 'out_file.txt'
    test_out = open(out_file, 'w')
    
    # define a string
    str1 = 'my first line'
    str2 = 'my second line'
    
    # write the strings
    test_out.write(str1 + '\n' +  str2)
    
    # close the file
    test_out.close()
    
    
    
    In [35]:
    # append a line to the file
    out_file = 'data' + os.sep + 'out_file.txt'
    test_out = open(out_file, 'a')
    
    # define a string
    str1 = '\nmy 3rd line\n'
    str2 = 'my 4th line'
    
    # write the strings
    test_out.write(str1 + str2)
    
    # close the file
    test_out.close()
    
    
    
    In [36]:
    # another way to open and parse a file is by using the "with" keyword. the 
    # advanatage of this is that it handles file closing automatically.
    
    with open(out_file, 'rb') as fin:
        read_data = fin.readlines()
    
        
    # no closing the file as we did above
    print read_data
    
    for line in read_data:
        print line.strip().split()
    
    
    
    
    ['my first line\r\n', 'my second line\r\n', 'my 3rd line\r\n', 'my 4th line']
    ['my', 'first', 'line']
    ['my', 'second', 'line']
    ['my', '3rd', 'line']
    ['my', '4th', 'line']
    

    CSV files

    CSV files are popular since they can be open in a spreadsheet or a text editor. They are organized in a table format. Here we'll work with rectangular csv files (no data missing) and later we'll work with general text files where some data is missing, data and comments are mixed together, where headers are part of the file to let the user know what kind of data he/she is processing.

    CSV file access functions

    • csv.reader
    • csv.writer
    • csv.DictReader
    • csv.DictWriter
    • csv.register_dialect
    • csv.unregister_dialect
    • csv.get_dialect
    • csv.list_dialects
    • csv.field_size_limit
    
    
    In [37]:
    # create a csv file the old fashioned way (just like another text file)
    
    
    # file name - look for it in the data folder
    import os
    csv_file = 'data' + os.sep + 'csv_file.csv'
    
    # define lines
    line1 = 'a,b,c,d,e\n'
    line2 = 'f,g,h,i,j\n'
    line3 = 'h,i,j,k,l'
    
    # write the data
    with open(csv_file, 'wb') as fout:
        fout.write(line1 + line2 + line3)
    
    
    
    In [38]:
    # let's read the file
    import csv
    
    with open(csv_file, 'rb') as fin:
        csv_reader = csv.reader(fin)
        
        for row in csv_reader:
            print row
    
    
    
    
    ['a', 'b', 'c', 'd', 'e']
    ['f', 'g', 'h', 'i', 'j']
    ['h', 'i', 'j', 'k', 'l']
    
    
    
    In [39]:
    # this shows the difference between reading the file using the previous methods
    # we learned above and the csv way
    with open(csv_file, 'rb') as fin:
        read_data = fin.readlines()
      
    # no closing the file as we did above
    print read_data
    
    
    
    
    ['a,b,c,d,e\n', 'f,g,h,i,j\n', 'h,i,j,k,l']
    
    
    
    In [40]:
    # as you can we we got one list with each row as a string element. 
    # this is not what we want from a csv file.
    
    
    
    In [41]:
    # let's read the file and store the data in a list the persists after the file
    # is closed
    data_list = []
    with open(csv_file, 'rb') as fin:
        csv_reader = csv.reader(fin)
        
        for row in csv_reader:
            data_list.append(row)
    
    
    
    In [42]:
    # show the data
    print data_list
    
    
    
    
    [['a', 'b', 'c', 'd', 'e'], ['f', 'g', 'h', 'i', 'j'], ['h', 'i', 'j', 'k', 'l']]
    
    
    
    In [43]:
    # here is what we have when we reverse the list - we're going to write 
    # this back to another file
    
    for row in reversed(data_list):
        print row
    
    
    
    
    ['h', 'i', 'j', 'k', 'l']
    ['f', 'g', 'h', 'i', 'j']
    ['a', 'b', 'c', 'd', 'e']
    
    
    
    In [44]:
    # let's write the data to a file in reverse order for the list and for each 
    # element in the list
    
    import csv
    import os
    
    csv_out_file = 'data' + os.sep + 'csv_out.csv'
    
    fout = open(csv_out_file, 'wb')
    
    csv_writer = csv.writer(fout)
    
    for row in reversed(data_list):
        csv_writer.writerow(row)
        
    fout.close()
    
    
    
    In [45]:
    # here is another example
    # creat another file with headers
    
    # file name - look for it in the data folder
    csv_file = 'data' + os.sep + 'csv_file2.csv'
    
    # for generating random numbers
    import random
    
    
    # define a random 2d array
    random_nums = [[str(random.random()) for _ in range(6)] for _ in range(6)]
    
    # generate a string with all the numbers and new line characters
    line = 'lon,lat,alt,roll,pitch,yaw\n'
    for row in random_nums:
        line += ','.join(row)+'\n'
    
    
    # strip the last EOL character 
    line = line.strip() # line = line[:-1] is another way of doing it   
    
    
    # write the file
    with open(csv_file, 'wb') as fout:
        fout.write(line)
    
    
    
    In [46]:
    # now that we have the file, we can read it in one swoop into a dictionary
    
    
    import csv
    
    # the csv file we want to read
    csv_file = 'data' + os.sep + 'csv_file2.csv'
    
    
    # read the file into a dictionary
    fin = open(csv_file, 'rb')
    csv_data = csv.DictReader(fin, delimiter=',')
    
    # let's see how the structure looks
    for line in csv_data:
        print line
        
    
        
    # check if the file is closed    
    print fin.closed
    
    # close it
    fin.close()
    
    # check again
    print fin.closed
    
    # let's see of we can access the csv_data again
    for line in csv_data:
        print line
    
    
    
    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-46-c3fea55e0bea> in <module>()
         28 
         29 # let's see of we can access the csv_data again
    ---> 30 for line in csv_data:
         31     print line
    
    C:\Python27\lib\csv.pyc in next(self)
        102             # Used only for its side effect.
        103             self.fieldnames
    --> 104         row = self.reader.next()
        105         self.line_num = self.reader.line_num
        106 
    
    ValueError: I/O operation on closed file
    {'yaw': '0.528824226743', 'lon': '0.263092270025', 'pitch': '0.148076408617', 'lat': '0.662308453532', 'alt': '0.332629503006', 'roll': '0.1783775897'}
    {'yaw': '0.238819730976', 'lon': '0.867796935112', 'pitch': '0.56615998845', 'lat': '0.371069637451', 'alt': '0.849238153107', 'roll': '0.305663415603'}
    {'yaw': '0.690815139565', 'lon': '0.362820091896', 'pitch': '0.340039066539', 'lat': '0.181811986991', 'alt': '0.864088615179', 'roll': '0.65358821448'}
    {'yaw': '0.867859846349', 'lon': '0.0257943331321', 'pitch': '0.7771720543', 'lat': '0.224272358026', 'alt': '0.558781266787', 'roll': '0.553860900387'}
    {'yaw': '0.858457154515', 'lon': '0.319157113471', 'pitch': '0.839665099121', 'lat': '0.935834737188', 'alt': '0.485786356564', 'roll': '0.891366550168'}
    {'yaw': '0.923332637644', 'lon': '0.172463762615', 'pitch': '0.0590637085953', 'lat': '0.268639628313', 'alt': '0.591748494576', 'roll': '0.876590865835'}
    False
    True
    
    
    
    In [47]:
    # as you can see above, the DictReader parsed the file and organized the data 
    # into a dictionary per line with the header as the key and the data as the value
    # however this data is live, which means it's a pointer to the file only so as soon
    # as the file is closed we lose the data
    # it also means that once we loop to the structure once we can access the data
    # again
    # read the file into a dictionary
    fin = open(csv_file, 'rb')
    csv_data = csv.DictReader(fin, delimiter=',')
    
    # let's see how the structure looks
    for line in csv_data:
        print line
        
    # this will give us nothing
    for line in csv_data:
        print line['yaw']
    
        
    fin.close()
    
    
    
    
    {'yaw': '0.528824226743', 'lon': '0.263092270025', 'pitch': '0.148076408617', 'lat': '0.662308453532', 'alt': '0.332629503006', 'roll': '0.1783775897'}
    {'yaw': '0.238819730976', 'lon': '0.867796935112', 'pitch': '0.56615998845', 'lat': '0.371069637451', 'alt': '0.849238153107', 'roll': '0.305663415603'}
    {'yaw': '0.690815139565', 'lon': '0.362820091896', 'pitch': '0.340039066539', 'lat': '0.181811986991', 'alt': '0.864088615179', 'roll': '0.65358821448'}
    {'yaw': '0.867859846349', 'lon': '0.0257943331321', 'pitch': '0.7771720543', 'lat': '0.224272358026', 'alt': '0.558781266787', 'roll': '0.553860900387'}
    {'yaw': '0.858457154515', 'lon': '0.319157113471', 'pitch': '0.839665099121', 'lat': '0.935834737188', 'alt': '0.485786356564', 'roll': '0.891366550168'}
    {'yaw': '0.923332637644', 'lon': '0.172463762615', 'pitch': '0.0590637085953', 'lat': '0.268639628313', 'alt': '0.591748494576', 'roll': '0.876590865835'}
    
    
    
    In [48]:
    # so to get the data by column, we do this
    fin = open(csv_file, 'rb')
    csv_data = csv.DictReader(fin, delimiter=',')
    
    # let's see how the structure looks
    for line in csv_data:
        print line['lat']
    
    
    
    
    0.662308453532
    0.371069637451
    0.181811986991
    0.224272358026
    0.935834737188
    0.268639628313
    
    
    
    In [49]:
    # let's write a csv file with headers using DictWriter
    
    # from above we saw that the structure generated from DictReader was a list
    # of dictionaries - so generate a list of dictionaries
    
    import os
    import csv
    
    # define the list
    rows = []
    rows.append({'name': 'mal', 'dob': 2468, 'role': 'captain'})
    rows.append({'name': 'zoe', 'dob': 2484, 'role': 'first mate'})
    rows.append({'name': 'wash', 'dob': 2468, 'role': 'pilot'})
    rows.append({'name': 'inara', 'dob': 2460, 'role': 'companion'})
    rows.append({'name': 'jayne', 'dob': 2463, 'role': 'mercenary'})
    
    # define the header
    header = ['name', 'dob', 'role']
    
    # open the file
    with open('data' + os.sep + 'csv_output_2.csv', 'wb') as fout:
        csv_writer = csv.DictWriter(fout, header)
        csv_writer.writeheader()
        csv_writer.writerows(rows)
    
    
    
    In [ ]: