In [6]:
import os
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)


Out[6]:
True

Dealing with ZIP files

The ZIP files contain a CSV file and a fixed width file. We only want the CSV file. We will store those in the RAW directory.

Lets get those variable for the EXTERNAL and the RAW directories.


In [7]:
# Get the project folders that we are interested in
PROJECT_DIR = os.path.dirname(dotenv_path)
EXTERNAL_DATA_DIR = PROJECT_DIR + os.environ.get("EXTERNAL_DATA_DIR")
RAW_DATA_DIR = PROJECT_DIR + os.environ.get("RAW_DATA_DIR")

# Get the list of filenames
files=os.environ.get("FILES").split()

print("Project directory is  : {0}".format(PROJECT_DIR))
print("External directory is : {0}".format(EXTERNAL_DATA_DIR))
print("Raw data directory is : {0}".format(RAW_DATA_DIR))
print("Base names of files   : {0}".format(" ".join(files)))


Project directory is  : /home/gsentveld/lunch_and_learn
External directory is : /home/gsentveld/lunch_and_learn/data/external
Raw data directory is : /home/gsentveld/lunch_and_learn/data/raw
Base names of files   : fmlydisb funcdisb familyxx househld injpoiep personsx samadult samchild paradata cancerxx

zipfile package

While some python packages that read files can handle compressed files, the zipfile package can deal with more complex zip files. The files we downloaded from have 2 files as their content. We just want the CSV files.
File objects are a bit more complex than other data structures. Opening, reading from, writing to them can all raise exceptions due to the permissions you may or may not have.
Access to the file is done via a file handler and not directly. You need to properly close them once you are done, otherwise your program keeps that file open as far as the operating system is concerned, potentially blocking other programs from accessing it.
To deal with that, you want to use the with zipfile.ZipFile() as zfile construction. Once the program leaves that scope, Python will nicely close any handlers to the object reference created. This also works great for database connections and other constructions that have these characteristics.


In [9]:
import zipfile

print ("Extracting files to: {}".format(RAW_DATA_DIR))
for file in files:
    
    # format the full zip filename in the EXTERNAL DATA DIR
    fn=EXTERNAL_DATA_DIR+'/'+file+'.zip'
    # and format the csv member name in that zip file
    member=file + '.csv'
    
    print("{0} extract {1}.".format(fn, member))
    
    # To make it easier to deal with files, use the with <> as <>: construction.
    # It will deal with opening and closing handlers for you.
    with zipfile.ZipFile(fn) as zfile:
        zfile.extract(member, path=RAW_DATA_DIR)


Extracting files to: /home/gsentveld/lunch_and_learn/data/raw
/home/gsentveld/lunch_and_learn/data/external/fmlydisb.zip extract fmlydisb.csv.
/home/gsentveld/lunch_and_learn/data/external/funcdisb.zip extract funcdisb.csv.
/home/gsentveld/lunch_and_learn/data/external/familyxx.zip extract familyxx.csv.
/home/gsentveld/lunch_and_learn/data/external/househld.zip extract househld.csv.
/home/gsentveld/lunch_and_learn/data/external/injpoiep.zip extract injpoiep.csv.
/home/gsentveld/lunch_and_learn/data/external/personsx.zip extract personsx.csv.
/home/gsentveld/lunch_and_learn/data/external/samadult.zip extract samadult.csv.
/home/gsentveld/lunch_and_learn/data/external/samchild.zip extract samchild.csv.
/home/gsentveld/lunch_and_learn/data/external/paradata.zip extract paradata.csv.
/home/gsentveld/lunch_and_learn/data/external/cancerxx.zip extract cancerxx.csv.