Web scraping

Web scraping is the process of capturing information from websites. Sometimes this is a simple as copying-and-pasting or downloading a file from the internet, which we have done many times. Sometimes, unfortunately, the information we want isn't so easily available. For those cases, we will search through an example website for series of data sets we want to be able to capture in some sort of process.

Here is a nice resource for this that we will use for parts of this topic.


In [ ]:
from bs4 import BeautifulSoup
import requests

1. Data

The data for today

Here is the example we'll be using for this topic: satellite data. A seminar speaker in Oceanography in fall 2015, Dr. Chuanmin Hu, shared information about some of his algorithms for satellite data processing. Satellite data gives an incredible spatial scale of information but require quite a bit of clever processing to get out the data that researchers actually want to use. This becomes increasingly true as we collectively build more complex algorithms to root out new perspectives, and to try to remove visual obstacles like clouds. Dr. Hu, in particular, has found a way to remove sun glint from some satellite data, which can be very useful if you care about a latitude where this tends to be a problem.

Dr. Hu's data is hosted online; we'll be using it today. Our goal is to select a data type and to then automate the process of downloading a year's worth of image files of that data.

Fetching data in general

First, some general notes from our resource listed above:

So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.

If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.

The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term.

Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.

Fetching our data in particular

Let's click around on the website. First we should navigate to the Satellite Data Products. We'll try North America > Mississippi River. Notice the web address now:

http://optics.marine.usf.edu/cgi-bin/optics_data?roi=MRIVER&current=1

but also note that it doesn't change when we click on other time tabs or dates.


Exercise

Check out November 1, 2016 — there are a lot of neat, unobstructed datasets over several times for different satellite passes for this day. Explore the data. What happens when you click on an image?


Data locations

So we've seen now that any particular satellite data image has its own unique address, for example:

http://optics.marine.usf.edu/subscription/modis/MRIVER/2016/daily/096/A20160961915.QKM.MRIVER.PASS.L3D_RRC.RGB.png

and another to compare with:

http://optics.marine.usf.edu/subscription/modis/MRIVER/2016/daily/096/T20160961605.QKM.MRIVER.PASS.L3D.ERGB.png

We want to get all of the files like this, but we don't want to click on each and save them by hand. In order to automate downloading this data, we need to search out the unique addresses for each of the times it is available, loop over those addresses, and save the data — pretty simple! The difficult part is deconstructing the webpage in order to automate this process.

Let's take apart this image web address. We see that the first part, http://optics.marine.usf.edu/subscription/, is consistent between the two and looks like just a base address. These two images are from different satellites (MODIS-A and MODIS-T, respectively), but they both still have modis next in the address. We can guess that MRIVER is for Mississippi river, so it is selecting out this particular geographic region. Then we see the year, 2016, and start getting into what is probably details about this particular file.


Exercise

To navigate through and understand the file system, go to http://optics.marine.usf.edu/subscription/modis/MRIVER and click around the links. What does each level of links refer to? How do you navigate to a single image file? What are all the different image files? Which one do you actually want, from the large list of image files you find?

Out of this exercise, you should come away with a sample link of a particular kind of data you want to automate the selection of, and have an understanding of all the pieces of the web address, such that you know which parts of the address are consistent between different image addresses and what is distinct and would need to be looped over to capture all of these files.


2. The webpage

Now that we better understand the makeup of the web addresses we'll be using, we need to look at how to take apart the web site in order to dynamically access all of these files. That means, using the bit of knowledge we just gained, we want to be able to mine the necessary data file locations from the website itself with as few things hard-coded in as possible. (The more things we hard-code in, the more likely it is that a minor change on the website will break our code.)

We'll be using requests and Beautiful Soup for this analysis. (For people who have done some of this before, note that Requests is now used preferentially over urllib and urllib2. More here.)

Accessing the website

We use Requests to access the website, starting with a particular year, 2016. We could instead not specify the year and read in all of the years' worth of files, but to limit the scope of our exercise, we'll just use 2016. Note that we have already investigated and seen that "daily" is the only subsequent option on the 2016 page, so we add that in here too.


In [ ]:
url = 'http://optics.marine.usf.edu/subscription/modis/MRIVER/2016/daily/'
# returns a "response" object from the website, then give the text content of that response
restext = requests.get(url).text
restext  # messy

In [ ]:
soup = BeautifulSoup(restext, "lxml")  # interprets the text from the website
soup  # more organized

Exercise

Figure out how to "View Source" on a website. You might need to google to find this for your browser, but usually you can just right-click and choose the option there. Compare the source for the .../2016/daily webpage with what we see above.


What we are seeing here is the makeup of the website, the html. If we look closely, we can pick out what we saw on the website: some header links and then links to each day of files. We want to harvest these links to where files are stored to avoid doing it manually. To do this, we need to think about how we might be able to move through the rows of the file, just like at the beginning of this class when we worked on reading files in line-by-line. We can use the method of a Beautiful Soup object, findAll, to pull out text that have certain named types.

For example, below we pull out the part of this site with is a header 1:


In [ ]:
soup.findAll('h1')

Exercise

Try sorting the html by other tags. What do you find? We'd like to be able to separately access each link to a day of files. What tag would let us do that?



In [ ]:

Let's try sorting by the tag 'a', which is for links in html.


In [ ]:
soup.findAll('a')

Ok! So we found something here maybe. The first few entries are for links that seem to be a header at the top of the page; we can compare with other pages to see that it is consistent. Let's use this to grab all the links we want to gather, and just skip the first few entries that are the header.

How can we access each of these sub-webpages using what we have here?


In [ ]:
row = soup.findAll('a')[5]  # grab the first row that contains a link we want
print('row:\n', row, '\n')

But we need this in the form of another web address:


In [ ]:
url + row.string  # .string() returns just the string inside the object

Now that we have an address, we can access it the same way we did for the first website:


In [ ]:
restext = requests.get(url + row.string).text  # access text on new website
soup_dir = BeautifulSoup(restext, "lxml")  # open up page for a day
soup_dir

We can again compare this with the website itself to help understand what we are seeing: another list of links.


Exercise

We already spent time earlier understanding what the different parts of the web address for an image file represents. So, find the sort of file that you want from the available options for this day. Which variable do you want to access? Which size image? Note that there is more than one time per day and more than one satellite data source.


We can now loop over the lines with links on both the first page (list of links to the daily page) and the second (list of links of files from a given day).


In [ ]:
image_locs = []
for row in soup.findAll('a')[5:20]:  # loop through each day, skipping header, but only do first part of Jan 2016 to save time
    
    restext = requests.get(url + row.string).text
    soup_dir = BeautifulSoup(restext, "lxml")  # open up page for a day
    
    for File in soup_dir.findAll('a')[5:]:  # find all files for this day, skipping header

        # search for the image file we want, might be more than one for a day
        fname = '.QKM.MRIVER.PASS.L3D_RRC.RGB.png'

        if fname in File.string:  # check for which link on this page is the data we want
            image_locs.append(url + row.string + File.string)  # save file address

Now we have all of the web addresses for the data files of this type for 2016:


In [ ]:
image_locs

Our next step would be to read in the data from these addresses, with something like the following:


In [ ]:
from PIL import Image
from io import BytesIO
import numpy as np

response = requests.get(image_locs[10])  # choose one of the files to show as an example
img = Image.open(BytesIO(response.content))
foo = np.asarray(img)

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

if np.ndim(foo) == 3:  # this is for real color image
    plt.imshow(foo)
elif np.ndim(foo) == 2:
    plt.pcolormesh(foo)

But, we will talk about image processing in a subsequent class.

3. Result

I did these steps previously in order to gather relevant satellite data for my research on my own computer so I could use it, and in order to make the plots myself and use better colormaps.

Here is my code for that effort. You will recognize a lot of it!

And here are the satellite images I made.