In this appendix lecture we'll go over how to scrape information from the web using Python.
1.) You should check a site's terms and conditions before you scrape them.
2.) Space out your requests so you don't overload the site's server, doing this could get you blocked.
3.) Scrapers break after time - web pages change their layout all the time, you'll more than likely have to rewrite your code.
4.) Web pages are usually inconsistent, more than likely you'll have to clean up the data after scraping it.
5.) Every web page and situation is different, you'll have to spend time configuring your scraper.
1.) BeautifulSoup, which you can download by typing: pip install beautifulsoup4 or conda install beautifulsoup4 (for the Anaconda distrbution of Python) in your command prompt.
2.) lxml , which you can download by typing: pip install lxml or conda install lxml (for the Anaconda distrbution of Python) in your command prompt.
3.) requests, which you can download by typing: pip install requests or conda install requests (for the Anaconda distrbution of Python) in your command prompt.
We'll start with our imports:
In [1]:
from bs4 import BeautifulSoup
import requests
In [2]:
import pandas as pd
from pandas import Series,DataFrame
For our quick web scraping tutorial, we'll look at some legislative reports from the University of California Web Page. Feel free to experiment with other webpages, but remember to be cautious and respectful in what you scrape and how often you do it. Always check the legality of a web scraping job.
Let's go ahead and set the url.
In [3]:
url = 'http://www.ucop.edu/operating-budget/budgets-and-reports/legislative-reports/2013-14-legislative-session.html'
Now let's go ahead and set up requests to grab content form the url, and set it as a Beautiful Soup object.
In [5]:
# Request content from web page
result = requests.get(url)
c = result.content
# Set as Beautiful Soup Object
soup = BeautifulSoup(c)
Now we'll use Beautiful Soup to search for the table we want to grab!
In [6]:
# Go to the section of interest
summary = soup.find("div",{'class':'list-land','id':'content'})
# Find the tables in the HTML
tables = summary.find_all('table')
Now we need to use Beautiful Soup to find the table entries. A 'td' tag defines a standard cell in an HTML table. The 'tr' tag defines a row in an HTML table.
We'll parse through our tables object and try to find each cell using the findALL('td') method.
There are tons of options to use with findALL in beautiful soup. You can read about them here.
In [7]:
# Set up empty data list
data = []
# Set rows as first indexed object in tables with a row
rows = tables[0].findAll('tr')
# now grab every HTML cell in every row
for tr in rows:
cols = tr.findAll('td')
# Check to see if text is in the row
for td in cols:
text = td.find(text=True)
print text,
data.append(text)
Let's see what the data list looks like
In [8]:
data
Out[8]:
Now we'll use a for loop to go through the list and grab only the cells with a pdf file in them, we'll also need to keep track of the index to set up the date of the report.
In [9]:
# Set up empty lists
reports = []
date = []
# Se tindex counter
index = 0
# Go find the pdf cells
for item in data:
if 'pdf' in item:
# Add the date and reports
date.append(data[index-1])
# Get rid of \xa0
reports.append(item.replace(u'\xa0', u' '))
index += 1
You'll notice a line to take care of '\xa0 ' This is due to a unicode error that occurs if you don't do this. Web pages can be messy and inconsistent and it is very likely you'll have to do some research to take care of problems like these.
Here's the link I used to solve this particular issue: StackOverflow Page
Now all that is left is to organize our data into a pandas DataFrame!
In [10]:
# Set up Dates and Reports as Series
date = Series(date)
reports = Series(reports)
In [11]:
# Concatenate into a DataFrame
legislative_df = pd.concat([date,reports],axis=1)
In [12]:
# Set up the columns
legislative_df.columns = ['Date','Reports']
In [13]:
# Show the finished DataFrame
legislative_df
Out[13]:
There are other less intense options for web scraping:
Check out these two companies:
In [1]:
# http://docs.python-guide.org/en/latest/scenarios/scrape/
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
# inspect element
# <div title="buyer-name">Carson Busses</div>
# <span class="item-price">$29.95</span>
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
In [ ]:
# https://www.flightradar24.com/56.16,-52.58/7
# http://stackoverflow.com/questions/39489168/how-to-scrape-real-time-streaming-data-with-python
# If you look at the network tab in the developer console in Chrome (for example), you'll see the requests to https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=59.09,52.64,-58.77,-47.71&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1
import requests
from bs4 import BeautifulSoup
import time
def get_count():
url = "https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=57.78,54.11,-56.40,-48.75&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1"
# Request with fake header, otherwise you will get an 403 HTTP error
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
# Parse the JSON
data = r.json()
counter = 0
# Iterate over the elements to get the number of total flights
for element in data["stats"]["total"]:
counter += data["stats"]["total"][element]
return counter
while True:
print(get_count())
time.sleep(8)
# Hmm, that was just my first thaught. As I wrote, the code is not meant as something final