In [1]:
import requests
from bs4 import BeautifulSoup
In [2]:
# Grab the NYT's homepage
response = requests.get("http://nytimes.com")
doc = BeautifulSoup(response.text)
In [ ]:
# Snag all of the headlines (h3 tags with 'story-heading' class)
headlines = doc.find_all("h3", {'class': 'story-heading'})
# Getting the headline text out using list comprehensions
# is a lot more fun but I guess you just learned those
# like a day ago, so we'll go ahead and use a for loop.
# But for the curious:
# [headline.text.strip() for headline in headlines]
# Print the text of the headlines
for headline in headlines:
print(headline.)
So the issue is that sometimes you need to submit forms on a web site. Why? Well, let's look at an example.
This example is going to come from Dan Nguyen's incredible Search, Script, Scrape, 101 scraping exercises.
The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient
Related URL: http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm
When you visit that URL, you're going to type in "Fentanyl," and select "Disc (Discontinued Drug Products)." Then you'll hit search.
Hooray, results! Now look at the URL.
http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm
Does anything about that URL say "Fentanyl" or "Discontinued Drug Products"? Nope! And if you straight up visit it (might need to open an Incognito window) you'll end up being redirected back to a different page.
This means requests.get just isn't going to cut it. If you tell requests to download that page it's going to get a whooole lot of uselessness.
Be my guest if you want to try it!
In [ ]:
requestsThere are two kinds of forms, GET forms and POST forms (...this is 99% true).
GET formsA GET form is one where you can see parameters in the URL. For example, if you searched for images of animals surfing on Bing you'd end up here:
http://www.bing.com/images/search?q=animals+surfing&FORM=HDRSC2
It has a couple parameters - q and FORM. FORM is some sort of weird analytics thing that doesn't affect the page, but q is definitely the term you're searching for. With a GET form, the data you put into the form is kept in the URL.
Just for kicks, if we looked at the HTML for a GET form it might look like this:
<form method="GET" action="/search">
<input type="text" name="q">
</form>
It might also leave the whole method part off, too - GET is the default.
A fun part about GET forms you can share the URL to share the results. If you don't believe me, visit http://www.bing.com/images/search?q=animals+surfing&FORM=HDRSC2 to see animals surfing.
GET is how most web pages work. You've used it every time you invoke the unholy powers of requests.get.
requests.get("https://api.spotify.com/v1/search?query=90s&offset=20&limit=20&type=playlist")
GET is nice. GET is easy. But GET is not all there is.
POST formsThe other kind of forms are POST forms. POST forms are not friendly!
Unlike GET forms, you can't share the URL to get the same information. The parameters - the q for your search, for example - aren't in the URL, they're hidden in the actual request.
What this means is that every time you request something from a POST-based form, you have to pretend you filled out the form and clicked the button.
First we need to find out what parameters we're going to hunt down. To do this, first make your way to the form, then get prepared.
1) In Chrome, View > Developer > Developer Tools
2) Click the Network tab
3) Fill the form out, and submit it
4) Scroll up to the top of the Network pane, select the segment of the URL you're at (I'm at tempai.cfm)
5) Click it
6) Select Headers on the right
7) Scroll down until you see Form Data
Okay, that seemed like a lot of work, but I promise it was actually simple and easy and you're living life in a grand grand way. Two parameters are listed for the search we're doing:
Generic_Name:Fentanyl
table1:OB_Disc
Seems simple enough! Now let's put them to work.
POST forms with requests.postThis is going to be so easy you might have a heart attack as a result of your body being so amazed that it doesn't have to do anything strenuous. All you have to do is
requests.get("http://whatever.com/url/to/something", { "param1": "val1", "param2": "val2" })
and treat it like a normal response! Here, I'll prove it.
In [ ]:
# Just in case you didn't run it up there, I'll import again
import requests
In [ ]:
# Using .select instead of .find is a little more
# readable to people from the web dev world, maybe?
It's magic, I swear!
Sometimes requests.get just isn't enough. Why? It mostly has to do with JavaScript or complicated forms - when a site reacts and changes without loading a new page, you can't use requests for that (think "Load more" buttons on Instagram).
For those sites you need Selenium! Selenium = you put your browser on autopilot. As in, literally, it takes control over your browser. There are "headless" versions that use invisible browsers but if you don't like to install a bunch of stuff, the normal version is usually fine.
Selenium isn't just a Python package, but you'll need to install python bindings in order to have Python talk to Selenium.
pip install selenium
You'll also need the Firefox browser, since that's the browser we're going to be controlling.
Selenium is built on WebDrivers, which are libraries that let you... drive a browser. I believe it comes with a Firefox WebDriver, whereas Safari/Chrome/etc take a little more effort to set up.
In [ ]:
# Imports, of course
In [ ]:
# Initialize a Firefox webdriver
In [ ]:
# Grab the web page
In [ ]:
# You'll use selenium.webdriver.support.ui.Select
# that we imported above to grab the Seelct element called
# t_web_lookup__license_type_name, then select Acupuncturists
# We use .find_element_by_name here because we know the name
In [ ]:
# We use .find_element_by_id here because we know the id
# Then we'll fake typing into it
In [ ]:
# Now we can grab the search button and click it
In [ ]:
# Instead of using requests.get, we just look at .page_source of the driver
In [ ]:
# We can feed that into Beautiful Soup
In [ ]:
# It's a tricky table, but this grabs the linked names inside of the A
#rows = doc.select("#datagrid_results tr")
In [ ]:
In [ ]:
# Close the webdriver
Now what are we going to do with our list of dictionaries? We could use a csv.DictWriter like in this post, but it's actually quicker to do it with pandas.
In [ ]:
In [ ]:
In [ ]:
In [ ]: