03 Fetching files on an FTP server with ftplib

A great deal of on-line data resources reside on FTP servers. And while the urllib module will usually work with those servers, the ftplib module has some additional commands that allow us more control in listing and retrieving entire batches of files on these servers.

Here we explore how this is done. We'll look at an example where we want to [potentially] download all of EPA's StreamCat data stored here:

ftp://newftp.epa.gov/EPADataCommons/ORD/NHDPlusLandscapeAttributes/StreamCat/States

These data include a vast collection of attributes for National Hydrographic Data catchments. We will explore how we can instruct Python to download just Rhode Island data (again, because the files are small...) from the StreamCat ftp server. However, we'll design this notebook so that it can easily be tweaked to download any state.

The process for grapping data on a ftp server this way is a tad more involved than in previous examples. It requires the following steps:

  • Creating a link to the ftp server
  • Logging in to the server
  • Navigating to the remote folder where the files we want are stored
  • Creating a list of the files we want
    • (unless it's just one file and know its name...)
  • Iterating through this list and downloading each one individually

This last step has a few sub-steps, but we'll get to that below.


This link has some good description of the ftplib module:
http://www.pythonforbeginners.com/code-snippets-source-code/how-to-use-ftp-in-python

First, we'll do some boilerplate stuff. This includes importing the libraries (we'll need the os module too), setting some variables, and creating a place where our downloads will live.


In [ ]:
#import the modules
import os
import ftplib

In [ ]:
#Set variables for the address of the web server and the folder holding our data
ftpURL = "newftp.epa.gov"
ftpDirectory = "/EPADataCommons/ORD/NHDPlusLandscapeAttributes/StreamCat/States/"

In [ ]:
#Create the base output folder, if it doesn't exist
outFolder = "./StreamCat"
if not os.path.exists(outFolder): os.mkdir(outFolder)

In [ ]:
#Set the state as a variable, so it's easily changed
state = 'RI'

In [ ]:
#Make a subfolder for the state in the outFolder, to facilitate file management
stateFolder = outFolder + os.sep + state
if not os.path.exists(stateFolder): os.mkdir(stateFolder)

Now that we've defined the state and created some folder to hold the data, we'll finally dive into the ftplib module to access the server and grab our data

The first step is to log into the server. Most ftp servers are "anonymous ftp" servers meaning, they allow anyone to log in. However, proper etiquette dictates you login as "anonymous" and use your email so they have a record of who is accessing their data...

Below, we log into the EPA ftp server, creating an 'ftp' object which is our programmatic link to the server. Once we link to it, we log in. You should change "user@duke.edu" to your email address.


In [ ]:
ftp = ftplib.FTP(ftpURL)
ftp.login("anonymous","user@duke.edu")

Now that we are logged in, we can send ftp commands to get information if we want. For example, the welcome command returns a message the server admin wants you to see when you log into the server:


In [ ]:
print(ftp.welcome)

What we need to do is navigate to the folder holding the files we want, which we found by navigating the ftp server in a web browser - and stored in the ftpDirectory variable.

We get there using the ftp cwd (change working directory) command:


In [ ]:
ftp.cwd(ftpDirectory)

Now, let's get a list of the files in this file, done with the nlst command.


In [ ]:
files = ftp.nlst()
len(files) #Over 2000 files here!

In [ ]:
#Create a list of just Rhode Island files, that is one's that end in "_RI.zip"
RI_files = ftp.nlst("*_RI.zip")
len(RI_files) #49 files; that's better...

Now that we have a list of files, we could write a loop that iterates through each of this. However, for simplicity, let's just grab the first file in the list and download that.


In [ ]:
#First, let's get the file we want to download. We'll just get the first item in the list
f = RI_files[0] 
print(f)

This is where the ftp module is much less intuitive. The ftp.retbinary command initiates the download of a binary file, but it requires a "callback function", which is an action telling the ftp module what to do with this stream of downloaded information. For our callback function, we'll be writing the binary contents to a file.

Don't worry too much about this syntax; just don't modify it (other than perhaps the output filename) and you'll be fine.


In [ ]:
#First we need to create a filename to which the remote contents will be written
outFilename = stateFolder + os.sep + f
print (outFilename)

In [ ]:
#Ok, here's the command that actually grabs the file and writes it locally
ftp.retrbinary("RETR " + f,open(outFilename,'wb').write)

In [ ]:
#Did it work?
os.listdir(stateFolder)

In [ ]:
#Close the ftp connection
ftp.close()

So what would it take to download the same file for, say, North Carolina? Not much!

How would you tweak this script to grab all the RI files?