To now, we've covered means of grabbing data that are formatted to grab. The term 'web scraping' refers to the messier means of pulling material from web sites that were really meant for people, not for computers. Web sites, of course, can include a variety of objects: text, images, video, flash, etc., and your success at scraping what you want will vary. In other words, scraping involves a bit of MacGyvering.
Useful packages for scraping are requests and bs4/BeautifulSoup, which code is included to install these below.
We'll run through a few quick examples, but for more on this topic, I recommend:
In [ ]:
# Import the requests package; install if necessary
try:
import requests
except:
import pip
pip.main(['install','requests'])
import requests
In [ ]:
# Import BeautifulSoup from the bs4 package; install bs4 if necessary
try:
from bs4 import BeautifulSoup
except:
import pip
pip.main(['install','bs4'])
from bs4 import BeautifulSoup
In [ ]:
# Import re, a package for using regular expressions
import re
The requests package works a lot like the urllib package in that it sends a request to a server and stores the servers response in a variable, here named response.
In [ ]:
# Send a request to a web page
response = requests.get('https://xkcd.com/869')
In [ ]:
# The response object simply has the contents of the web page at the address provided
print(response.text)
BeautifulSoup is designed to intelligently read raw HTML code, i.e., what is stored in the response variable generated above. The command below reads in the raw HTML and parses it into logical components that we can command.
The lxml in the command specifies a particular parser for deconstructing the HTML...
In [ ]:
# BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')
type(soup)
Here we search the text of the web page's body for any instances of https://....png, that is any link to a PNG image embedded in the page. This is done using re and implementing regular expressions (see https://developers.google.com/edu/python/regular-expressions for more info on this useful module...)
The match object returned by search() holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs. The group property of the match is the full string that's returned
In [ ]:
#Search the page for emebedded links to PNG files
match = re.search('https://.*\.png', soup.body.text)
In [ ]:
#What was found in the search
print(match.group())
In [ ]:
#And here is some Juptyer code to display the picture resulting from it
from IPython.display import Image
Image(url=match.group())