In [1]:
import requests
from bs4 import BeautifulSoup

Normal scraping

By now we all know how to scrape normal sites (kind of, mostly, somewhat).


In [2]:
# Grab the NYT's homepage
response = requests.get("http://nytimes.com")
doc = BeautifulSoup(response.text)


/usr/lib/python3.4/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

In [ ]:
# Snag all of the headlines (h3 tags with 'story-heading' class)
headlines = doc.find_all("h3", {'class': 'story-heading'})
# Getting the headline text out using list comprehensions
# is a lot more fun but I guess you just learned those
# like a day ago, so we'll go ahead and use a for loop.
# But for the curious:
#   [headline.text.strip() for headline in headlines]

# Print the text of the headlines
for headline in headlines:
    print(headline.)

But... forms!

So the issue is that sometimes you need to submit forms on a web site. Why? Well, let's look at an example.

This example is going to come from Dan Nguyen's incredible Search, Script, Scrape, 101 scraping exercises.

The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient

Related URL: http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm

When you visit that URL, you're going to type in "Fentanyl," and select "Disc (Discontinued Drug Products)." Then you'll hit search.

Hooray, results! Now look at the URL.

http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm

Does anything about that URL say "Fentanyl" or "Discontinued Drug Products"? Nope! And if you straight up visit it (might need to open an Incognito window) you'll end up being redirected back to a different page.

This means requests.get just isn't going to cut it. If you tell requests to download that page it's going to get a whooole lot of uselessness.

Be my guest if you want to try it!


In [ ]:

Submitting forms with requests

There are two kinds of forms, GET forms and POST forms (...this is 99% true).

GET forms

A GET form is one where you can see parameters in the URL. For example, if you searched for images of animals surfing on Bing you'd end up here:

http://www.bing.com/images/search?q=animals+surfing&FORM=HDRSC2

It has a couple parameters - q and FORM. FORM is some sort of weird analytics thing that doesn't affect the page, but q is definitely the term you're searching for. With a GET form, the data you put into the form is kept in the URL.

Just for kicks, if we looked at the HTML for a GET form it might look like this:

<form method="GET" action="/search">
<input type="text" name="q">
</form>

It might also leave the whole method part off, too - GET is the default.

A fun part about GET forms you can share the URL to share the results. If you don't believe me, visit http://www.bing.com/images/search?q=animals+surfing&FORM=HDRSC2 to see animals surfing.

GET is how most web pages work. You've used it every time you invoke the unholy powers of requests.get.

requests.get("https://api.spotify.com/v1/search?query=90s&offset=20&limit=20&type=playlist")

GET is nice. GET is easy. But GET is not all there is.

POST forms

The other kind of forms are POST forms. POST forms are not friendly!

Unlike GET forms, you can't share the URL to get the same information. The parameters - the q for your search, for example - aren't in the URL, they're hidden in the actual request.

What this means is that every time you request something from a POST-based form, you have to pretend you filled out the form and clicked the button.

Grabbing the parameters

First we need to find out what parameters we're going to hunt down. To do this, first make your way to the form, then get prepared.

1) In Chrome, View > Developer > Developer Tools 2) Click the Network tab 3) Fill the form out, and submit it 4) Scroll up to the top of the Network pane, select the segment of the URL you're at (I'm at tempai.cfm) 5) Click it 6) Select Headers on the right 7) Scroll down until you see Form Data

Okay, that seemed like a lot of work, but I promise it was actually simple and easy and you're living life in a grand grand way. Two parameters are listed for the search we're doing:

Generic_Name:Fentanyl
table1:OB_Disc

Seems simple enough! Now let's put them to work.

Submitting POST forms with requests.post

This is going to be so easy you might have a heart attack as a result of your body being so amazed that it doesn't have to do anything strenuous. All you have to do is

requests.get("http://whatever.com/url/to/something", { "param1": "val1", "param2": "val2" })

and treat it like a normal response! Here, I'll prove it.


In [ ]:
# Just in case you didn't run it up there, I'll import again
import requests

In [ ]:
# Using .select instead of .find is a little more
# readable to people from the web dev world, maybe?

It's magic, I swear!

But then...

Sometimes requests.get just isn't enough. Why? It mostly has to do with JavaScript or complicated forms - when a site reacts and changes without loading a new page, you can't use requests for that (think "Load more" buttons on Instagram).

For those sites you need Selenium! Selenium = you put your browser on autopilot. As in, literally, it takes control over your browser. There are "headless" versions that use invisible browsers but if you don't like to install a bunch of stuff, the normal version is usually fine.

Installing Selenium

Selenium isn't just a Python package, but you'll need to install python bindings in order to have Python talk to Selenium.

pip install selenium

You'll also need the Firefox browser, since that's the browser we're going to be controlling.

Selenium is built on WebDrivers, which are libraries that let you... drive a browser. I believe it comes with a Firefox WebDriver, whereas Safari/Chrome/etc take a little more effort to set up.

Using Selenium


In [ ]:
# Imports, of course

In [ ]:
# Initialize a Firefox webdriver

In [ ]:
# Grab the web page

In [ ]:
# You'll use selenium.webdriver.support.ui.Select
# that we imported above to grab the Seelct element called 
# t_web_lookup__license_type_name, then select Acupuncturists

# We use .find_element_by_name here because we know the name

In [ ]:
# We use .find_element_by_id here because we know the id

# Then we'll fake typing into it

In [ ]:
# Now we can grab the search button and click it

In [ ]:
# Instead of using requests.get, we just look at .page_source of the driver

In [ ]:
# We can feed that into Beautiful Soup

In [ ]:
# It's a tricky table, but this grabs the linked names inside of the A
#rows = doc.select("#datagrid_results tr")

In [ ]:

Closing the webdriver

Once we have all the data we want, we can close our webdriver.


In [ ]:
# Close the webdriver

Saving our data

Now what are we going to do with our list of dictionaries? We could use a csv.DictWriter like in this post, but it's actually quicker to do it with pandas.

Step One: import pandas


In [ ]:

Step Two: Turn list into a DataFrame


In [ ]:

Step Three: Save it to a CSV

While you're saving it, set index=False or else it will include 0, 1, 2, etc from the further-left column (the index, of course).


In [ ]:

Step Four: Party down

I don't have directions for this one


In [ ]: