In [1]:
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics
import datetime as dt # date tools, used to note current date
%matplotlib inline
We have seen how to input data from csv
and xls
files -- either online or from our computer and through APIs. Sometimes the data is only available as specific part of a website.
We want to access the source code of the website and systematically extract the relevant information.
Hypertext Markup Language
(HTML) specifies the structure and main content of the site -- tells the browser how to layout content. Think of Markdown
.
It is structured using tags.
<html>
<head>
(Meta) Information about the page.
</head>
<body>
<p>
This is a paragraph.
</p>
<table>
This is a table
</table>
</body>
</html>
Tag
s determine the content and layout depending on their relation to other tags. Useful terminology:
child
-- a child is a tag inside another tag. The p
tag above is a child of the body
tag.parent
-- a parent is the tag another tag is inside. The body
tag above is a parent of the p
tag.sibling
-- a sibling is a tag that is nested inside the same parent as another tag. The head
and body
tags above are siblings.There are many different tags -- take a look at a reference list. You won't and shouldn't remember all of them but it's useful to have a rough idea about them.
And take a look at a real example -- open page, then right click: "View Page Source"
In the real example you will see that there is more information after the tag, most commanly a class
and an id
. Something similar to the following:
<html>
<head class='main-head'>
(Meta) Information about the page.
</head>
<body>
<p class='inner-paragraph' id='001'>
This is a paragraph.
<a href="https://www.dataquest.io">Learn Data Science Online</a>
</p>
<table class='inner-table' id='002'>
This is a table
</table>
</body>
</html>
The class
and id
information will help us in locating the information we are looking for in a systematic way. (Originally, classes and ids are used by CSS to determine which HTML elements to apply certain styles to)
Useful way to explore the html
and the corresponding website is right clicking on the web page and then clicking on Inspect element
-- interpretation of the html by the browser
Suppose we want to check prices for renting a room in Manhattan in Craigslist. Let's check for example the rooms & shares
section for the East Village.
We have to download the content of the webpage -- i.e. get the contents structured by the HTML. This we can do with the requests
library, which is a human readable HTTP (HyperText Transfer Protocol) library for python. You cna find the Quickstart Documentation here.
In [2]:
import requests # you might have to install this
In [3]:
url = 'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'
cl = requests.get(url)
cl
Out[3]:
After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully.
A status_code of 200 means that the page downloaded successfully.
- status code starting with a 2 generally indicates success
- a code starting with a 4 or a 5 indicates an error.
In [4]:
cl.status_code
Out[4]:
You might want to query for different things and download information for all of them.
keys
and values
).
In [5]:
url = 'https://newyork.craigslist.org/search/roo'
keys = {'query' : 'east village', 'availabilityMode' : '0'}
cl_extra = requests.get(url, params=keys)
In [6]:
# see if the URL was specified successfully
cl_extra.url
Out[6]:
Check tab completion
In [7]:
cl.url
Out[7]:
To print out the content of the html file, use the content
or text
properties
This is going to be ugly and unreadable
In [8]:
cl.text[:300]
Out[8]:
In [9]:
cl.content[:500] # this works also for information which is not purely text
Out[9]:
Now that we have the content of the web page we want to extraxt certain information. BeautifulSoup
is a Python package which helps us in doing that. See the documentation for more information.
We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:
In [10]:
from bs4 import BeautifulSoup
In [11]:
BeautifulSoup?
In [12]:
cl_soup = BeautifulSoup(cl.content, 'html.parser')
Print this out in a prettier way.
In [13]:
#print(cl_soup.prettify())
In [14]:
print('Type:', type(cl_soup))
In [15]:
# we can access a tag
print('Title: ', cl_soup.title)
In [16]:
# or only the text content
print('Title: ', cl_soup.title.text) # or
print('Title: ', cl_soup.title.get_text())
We can find all tags of certain type with the find_all
method. This returns a list.
In [17]:
cl_soup.find_all?
To get the first paragraph in the html write
In [18]:
cl_soup.find_all('p')[0]
Out[18]:
This is a lot of information and we want to extract some part of it. Use the text
or get_text()
method to get the text content.
In [19]:
cl_soup.find_all('p')[0].get_text()
Out[19]:
This is still messy. We will need a smarter search.
As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children
property of soup. For example here are the children of the first paragraph tag.
Note: children
returns a list iterator, so we need to call the list function on it.
In [20]:
list(cl_soup.find_all('p')[0].children)
Out[20]:
Look for tags based on their class. This is extremely useful for efficiently locating information.
In [21]:
cl_soup.find_all('span', class_='result-price')[0].get_text()
Out[21]:
In [22]:
cl_soup.find_all('span', class_='result-price')[:10]
Out[22]:
In [23]:
prices = cl_soup.find_all('span', class_='result-price')
In [24]:
price_data = [price.get_text() for price in prices]
In [25]:
price_data[:10]
Out[25]:
In [26]:
len(price_data)
Out[26]:
We are getting more cells than we want -- there were only 120 listings on the page. Check the ads with "Inspect Element". There are duplicates. We need a different tag level (<li>
)
In [27]:
cl_soup.find_all('li', class_='result-row')[0]
Out[27]:
In [28]:
ads = cl_soup.find_all('li', class_='result-row')
In [29]:
# we can access values of the keys by using a dictionary like syntax
ads[5].find('a', class_='result-title hdrlnk')
Out[29]:
In [30]:
ads[5].find('a', class_='result-title hdrlnk')['href']
Out[30]:
In [31]:
data = [[ad.find('a', class_='result-title hdrlnk').get_text(),
ad.find('a', class_='result-title hdrlnk')['data-id'],
ad.find('span', class_='result-price').get_text()] for ad in ads ]
What's going wrong? Some ads don't have a price listed, so we can't retrieve it.
In [32]:
# if it exists then the type is
type(ads[0].find('span', class_='result-price'))
Out[32]:
If it does not find the price, it returns a NoneType. We might exploit this fact to select only the valid links.
In [33]:
import bs4
data = [[ad.find('a', class_='result-title hdrlnk').get_text(),
ad.find('a', class_='result-title hdrlnk')['data-id'],
ad.find('span', class_='result-price').get_text()] for ad in ads
if type(ad.find('span', class_='result-price'))==bs4.element.Tag]
In [34]:
data[:10]
Out[34]:
In [35]:
df = pd.DataFrame(data)
In [36]:
df.head(10)
Out[36]:
In [37]:
df.shape
Out[37]:
We only have 118 listing because 2 listings did not have a price.
In [38]:
df.columns = ['Title', 'ID', 'Price']
In [39]:
df.head()
Out[39]:
We could do text anaylsis and see what words are common in ads which has a relatively higher price.
This approach is not really efficient because it only gets the first page of the search results. We see on the top of the CL page the total number of listings. In the Inspection
mode we can pick an element from the page and check how it is defined in the html
-- this is useful to get tags and classes efficiently.
For example, the total number of ads is a span
tag with a 'totalcount' class
.
In [40]:
cl_soup.find('span', class_='totalcount')
Out[40]:
We can see if we start clicking on the 2nd nd 3rd pages of the results that there is a structure in how they are defined
First page:
https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0
Second page:
https://newyork.craigslist.org/search/roo?s=120&availabilityMode=0&query=east%20village
Third page:
https://newyork.craigslist.org/search/roo?s=240&availabilityMode=0&query=east%20village
The number after roo?s=
in the domain specifies where the listings are starting from (not inclusive). In fact, if we modify it ourselves we can fine-tune the page starting from the corresponding listing and then showing 120 listings. Try it!
We can also define the first page by puttig s=0&
after roo?
like this:
https://newyork.craigslist.org/search/roo?s=0&availabilityMode=0&query=east%20village
In [41]:
# First we get the total number of listings in real time
url = 'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'
cl = requests.get(url)
cl_soup = BeautifulSoup(cl.content, 'html.parser')
total_count = int(cl_soup.find('span', class_='totalcount').get_text())
print(total_count)
We have the total number of listings with the given search specification. Breaking down the steps:
1) Specify the url of each page we want to scrape
2) For each page scrape the data -- we will reuse the code what we already have for one page
3) Save the data into one dataframe -- we can use the append
method for DataFrames or the extend
method for lists
In [42]:
# 1) Specify the url
for page in range(0, total_count, 120):
print('https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page))
In [43]:
# Next we write a loop to scrape all pages
df = pd.DataFrame({'Title' : [], 'ID' : [], 'Price' : []})
for page in range(0, total_count, 120):
url = 'https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page)
cl = requests.get(url)
cl_soup = BeautifulSoup(cl.content, 'html.parser')
ads = cl_soup.find_all('li', class_='result-row')
data = pd.DataFrame([[ad.find('a', class_='result-title hdrlnk').get_text(),
ad.find('a', class_='result-title hdrlnk')['data-id'],
ad.find('span', class_='result-price').get_text()] for ad in ads
if type(ad.find('span', class_='result-price'))==bs4.element.Tag],
columns=['Title', 'ID', 'Price'])
df = df.append(data, ignore_index=True)
In [44]:
df.head()
Out[44]:
In [45]:
# Do the same using the `extend` method
data = []
for page in range(0, total_count, 120):
url = 'https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page)
cl = requests.get(url)
cl_soup = BeautifulSoup(cl.content, 'html.parser')
ads = cl_soup.find_all('li', class_='result-row')
data_page = [[ad.find('a', class_='result-title hdrlnk').get_text(),
ad.find('a', class_='result-title hdrlnk')['data-id'],
ad.find('span', class_='result-price').get_text()] for ad in ads
if type(ad.find('span', class_='result-price'))==bs4.element.Tag]
data.extend(data_page)
df = pd.DataFrame(data, columns=['Title', 'ID', 'Price'])
In [46]:
df.head()
Out[46]:
In [47]:
df.shape
Out[47]:
In [48]:
df.tail()
Out[48]:
We have scraped all the listings from CL in section "Rooms and Shares" for the East Village.
Suppose you have a couple of destinations in mind and you want to check the weather for each of them for this Friday. You want to get it from the National Weather Service.
These are the places I want to check (suppose there are many more and you want to automate it):
locations = ['Bozeman, Montana', 'White Sands National Monument', 'Stanford University, California']
It seems that the NWS is using latitude and longitude coordinates in its search.
i.e. for White Sands http://forecast.weather.gov/MapClick.php?lat=32.38092788700044&lon=-106.4794398029997
Would be cool to pass these on as arguments.
After some Google fu (i.e. "latitude and longitude of location python") find a post by Chris Albon which describes exactly what we want.
"Geocoding (converting a phyiscal address or location into latitude/longitude) and reverse geocoding (converting a lat/long to a phyiscal address or location)[...] Python offers a number of packages to make the task incredibly easy [...] use pygeocoder, a wrapper for Google's geo-API, to both geocode and reverse geocode.
Install pygeocoder
through pip install pygeocoder
(from conda
only the OSX version is available).
In [50]:
from pygeocoder import Geocoder
In [51]:
# check for one of the locations how it's working
# some addresses might not be valid -- it goes through Google's API
loc = Geocoder.geocode('Bozeman, Montana')
loc.coordinates
Out[51]:
In [52]:
Geocoder.geocode('Stanford, California').coordinates
Out[52]:
We can check whether it's working fine at http://www.latlong.net/
In [53]:
locations = ['Bozeman, Montana', 'White Sands National Monument', 'Stanford University, California']
coordinates = [Geocoder.geocode(location).coordinates for location in locations]
In [54]:
coordinates
Out[54]:
In [55]:
for location, coordinate in zip(locations, coordinates):
print('The coordinates of {} are:'.format(location), coordinate)
Define a dictionary for the parameters we want to pass to the GET request for NWS server.
In [56]:
keys = {}
for location, coordinate in zip(locations, coordinates):
keys[location] = {'lat' : coordinate[0], 'lon' : coordinate[1]}
In [57]:
keys
Out[57]:
Recall the format of the url associated with a particular location
http://forecast.weather.gov/MapClick.php?lat=32.38092788700044&lon=-106.4794398029997
In [58]:
keys[locations[0]]
Out[58]:
In [59]:
url = 'http://forecast.weather.gov/MapClick.php'
nws = requests.get(url, params=keys[locations[0]])
In [60]:
nws.status_code
Out[60]:
In [61]:
nws.url
Out[61]:
In [62]:
nws.content[:300]
Out[62]:
In [63]:
nws_soup = BeautifulSoup(nws.content, 'html.parser')
In [64]:
seven = nws_soup.find('div', id='seven-day-forecast-container')
In [65]:
seven.find(text='Friday')
Out[65]:
In [66]:
seven.find(text='Friday').parent
Out[66]:
In [67]:
seven.find(text='Friday').parent.parent
Out[67]:
In [68]:
seven.find(text='Friday').parent.parent.find('p', class_='temp temp-high').get_text()
Out[68]:
In [69]:
data = []
for location in locations:
nws = requests.get(url, params=keys[location])
nws_soup = BeautifulSoup(nws.content, 'html.parser')
seven = nws_soup.find('div', id='seven-day-forecast-container')
temp = seven.find(text='Friday').parent.parent.find('p', class_='temp temp-high').get_text()
data.append([location, temp])
In [70]:
df_weather = pd.DataFrame(data, columns=['Location', 'Friday weather'])
In [71]:
df_weather
Out[71]:
In [72]:
df_weather['high_temp'] = df_weather['Friday weather'].str.rsplit().str.get(1).astype(float)
In [73]:
df_weather['high_temp'].std()
Out[73]:
In [ ]: