Web scraping

Date: 28 March 2017

@author: Daniel Csaba

Preliminaries

Import usual packages.


In [1]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  

%matplotlib inline

We have seen how to input data from csv and xls files -- either online or from our computer and through APIs. Sometimes the data is only available as specific part of a website.

We want to access the source code of the website and systematically extract the relevant information.

Again, use Google fu to find useful links. Here are a couple:

Structure of web pages (very simplistic)

Hypertext Markup Language (HTML) specifies the structure and main content of the site -- tells the browser how to layout content. Think of Markdown.

It is structured using tags.

<html>
    <head>
        (Meta) Information about the page.
    </head>
    <body>
        <p>
            This is a paragraph.
        </p>
        <table>
            This is a table
        </table>
    </body>
</html>

Tags determine the content and layout depending on their relation to other tags. Useful terminology:

  • child -- a child is a tag inside another tag. The p tag above is a child of the body tag.
  • parent -- a parent is the tag another tag is inside. The body tag above is a parent of the p tag.
  • sibling -- a sibling is a tag that is nested inside the same parent as another tag. The head and body tags above are siblings.

There are many different tags -- take a look at a reference list. You won't and shouldn't remember all of them but it's useful to have a rough idea about them.

And take a look at a real example -- open page, then right click: "View Page Source"

In the real example you will see that there is more information after the tag, most commanly a class and an id. Something similar to the following:

<html>
    <head class='main-head'>
        (Meta) Information about the page.
    </head>
    <body>
        <p class='inner-paragraph' id='001'>
            This is a paragraph.
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <table class='inner-table' id='002'>
            This is a table
        </table>
    </body>
</html>

The class and id information will help us in locating the information we are looking for in a systematic way. (Originally, classes and ids are used by CSS to determine which HTML elements to apply certain styles to)

Useful way to explore the html and the corresponding website is right clicking on the web page and then clicking on Inspect element -- interpretation of the html by the browser

Suppose we want to check prices for renting a room in Manhattan in Craigslist. Let's check for example the rooms & shares section for the East Village.

Accessing web pages

We have to download the content of the webpage -- i.e. get the contents structured by the HTML. This we can do with the requests library, which is a human readable HTTP (HyperText Transfer Protocol) library for python. You cna find the Quickstart Documentation here.


In [2]:
import requests                              # you might have to install this

In [3]:
url = 'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'

cl = requests.get(url)
cl


Out[3]:
<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully.

A status_code of 200 means that the page downloaded successfully.

  • status code starting with a 2 generally indicates success
  • a code starting with a 4 or a 5 indicates an error.

In [4]:
cl.status_code


Out[4]:
200

You might want to query for different things and download information for all of them.

  • You can pass this as extra information (defined as a dictionary with keys and values).
  • Best way to learn about the available keys is by "changing" things on the site and see how the url changes

In [5]:
url = 'https://newyork.craigslist.org/search/roo'
keys = {'query' : 'east village', 'availabilityMode' : '0'}
cl_extra = requests.get(url, params=keys)

In [6]:
# see if the URL was specified successfully
cl_extra.url


Out[6]:
'https://newyork.craigslist.org/search/roo?availabilityMode=0&query=east+village'

Check tab completion


In [7]:
cl.url


Out[7]:
'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'

To print out the content of the html file, use the content or text properties

This is going to be ugly and unreadable


In [8]:
cl.text[:300]


Out[8]:
'\ufeff<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist</title>\n\n    <meta name="description" content="new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist">\n    <meta http-equiv="X-U'

In [9]:
cl.content[:500] # this works also for information which is not purely text


Out[9]:
b'\xef\xbb\xbf<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist</title>\n\n    <meta name="description" content="new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>\n    <link rel="canonical" href="https://newyork.craigslist.org/search/roo">\n    <link rel="alternate" type="application/rss+xml" href="https://newyork.craigslist.or'

Extracting information from a web page

Now that we have the content of the web page we want to extraxt certain information. BeautifulSoup is a Python package which helps us in doing that. See the documentation for more information.

We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:


In [10]:
from bs4 import BeautifulSoup

In [11]:
BeautifulSoup?

In [12]:
cl_soup = BeautifulSoup(cl.content, 'html.parser')

Print this out in a prettier way.


In [13]:
#print(cl_soup.prettify())

In [14]:
print('Type:', type(cl_soup))


Type: <class 'bs4.BeautifulSoup'>

In [15]:
# we can access a tag 
print('Title: ', cl_soup.title)


Title:  <title>new york rooms for rent &amp; shares available "east village" - craigslist</title>

In [16]:
# or only the text content
print('Title: ', cl_soup.title.text)          # or
print('Title: ', cl_soup.title.get_text())


Title:  new york rooms for rent & shares available "east village" - craigslist
Title:  new york rooms for rent & shares available "east village" - craigslist

We can find all tags of certain type with the find_all method. This returns a list.


In [17]:
cl_soup.find_all?

To get the first paragraph in the html write


In [18]:
cl_soup.find_all('p')[0]


Out[18]:
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2017-04-04 10:34" title="Tue 04 Apr 10:34:18 AM">Apr  4</time>
<a class="result-title hdrlnk" data-id="6073647116" href="/mnh/roo/6073647116.html">$1550 1 BR available in 2BR/1Ba Apt in East Village</a>
<span class="result-meta">
<span class="result-price">$1550</span>
<span class="result-hood"> (East Village)</span>
<span class="result-tags">
                    pic
                    <span class="maptag" data-pid="6073647116">map</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
<a class="restore-link" href="#">
<span class="restore-narrow-text">restore</span>
<span class="restore-wide-text">restore this posting</span>
</a>
</span>
</p>

This is a lot of information and we want to extract some part of it. Use the text or get_text() method to get the text content.


In [19]:
cl_soup.find_all('p')[0].get_text()


Out[19]:
'\n\nfavorite this post\n\nApr  4\n$1550 1 BR available in 2BR/1Ba Apt in East Village\n\n$1550\n (East Village)\n\n                    pic\n                    map\n\n\nhide this posting\n\n\n\nrestore\nrestore this posting\n\n\n'

This is still messy. We will need a smarter search.

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. For example here are the children of the first paragraph tag.

Note: children returns a list iterator, so we need to call the list function on it.


In [20]:
list(cl_soup.find_all('p')[0].children)


Out[20]:
['\n', <span class="icon icon-star" role="button">
 <span class="screen-reader-text">favorite this post</span>
 </span>, '\n', <time class="result-date" datetime="2017-04-04 10:34" title="Tue 04 Apr 10:34:18 AM">Apr  4</time>, '\n', <a class="result-title hdrlnk" data-id="6073647116" href="/mnh/roo/6073647116.html">$1550 1 BR available in 2BR/1Ba Apt in East Village</a>, '\n', <span class="result-meta">
 <span class="result-price">$1550</span>
 <span class="result-hood"> (East Village)</span>
 <span class="result-tags">
                     pic
                     <span class="maptag" data-pid="6073647116">map</span>
 </span>
 <span class="banish icon icon-trash" role="button">
 <span class="screen-reader-text">hide this posting</span>
 </span>
 <span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
 <a class="restore-link" href="#">
 <span class="restore-narrow-text">restore</span>
 <span class="restore-wide-text">restore this posting</span>
 </a>
 </span>, '\n']

Look for tags based on their class. This is extremely useful for efficiently locating information.


In [21]:
cl_soup.find_all('span', class_='result-price')[0].get_text()


Out[21]:
'$1550'

In [22]:
cl_soup.find_all('span', class_='result-price')[:10]


Out[22]:
[<span class="result-price">$1550</span>,
 <span class="result-price">$1550</span>,
 <span class="result-price">$1595</span>,
 <span class="result-price">$900</span>,
 <span class="result-price">$900</span>,
 <span class="result-price">$1500</span>,
 <span class="result-price">$1500</span>,
 <span class="result-price">$1100</span>,
 <span class="result-price">$1100</span>,
 <span class="result-price">$1734</span>]

In [23]:
prices = cl_soup.find_all('span', class_='result-price')

In [24]:
price_data = [price.get_text() for price in prices]

In [25]:
price_data[:10]


Out[25]:
['$1550',
 '$1550',
 '$1595',
 '$900',
 '$900',
 '$1500',
 '$1500',
 '$1100',
 '$1100',
 '$1734']

In [26]:
len(price_data)


Out[26]:
219

We are getting more cells than we want -- there were only 120 listings on the page. Check the ads with "Inspect Element". There are duplicates. We need a different tag level (<li>)


In [27]:
cl_soup.find_all('li', class_='result-row')[0]


Out[27]:
<li class="result-row" data-pid="6073647116">
<a class="result-image gallery" data-ids="1:00o0o_fC2jCoSCLyf,1:00a0a_89spemQXEWi,1:00v0v_Vc4dgRYhMV,1:00B0B_lNkT7CR5Qw2,1:01616_dLwzQTRoU0a,1:00c0c_iEKKc8z205x" href="/mnh/roo/6073647116.html">
<span class="result-price">$1550</span>
</a>
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2017-04-04 10:34" title="Tue 04 Apr 10:34:18 AM">Apr  4</time>
<a class="result-title hdrlnk" data-id="6073647116" href="/mnh/roo/6073647116.html">$1550 1 BR available in 2BR/1Ba Apt in East Village</a>
<span class="result-meta">
<span class="result-price">$1550</span>
<span class="result-hood"> (East Village)</span>
<span class="result-tags">
                    pic
                    <span class="maptag" data-pid="6073647116">map</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
<a class="restore-link" href="#">
<span class="restore-narrow-text">restore</span>
<span class="restore-wide-text">restore this posting</span>
</a>
</span>
</p>
</li>

In [28]:
ads = cl_soup.find_all('li', class_='result-row')

In [29]:
# we can access values of the keys by using a dictionary like syntax
ads[5].find('a', class_='result-title hdrlnk')


Out[29]:
<a class="result-title hdrlnk" data-id="6072517459" href="/mnh/roo/6072517459.html">One bedroom in three bedroom East Village doorman/elevator apartment</a>

In [30]:
ads[5].find('a', class_='result-title hdrlnk')['href']


Out[30]:
'/mnh/roo/6072517459.html'

In [31]:
data = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads ]


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-5f66cb53d05d> in <module>()
      1 data = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
      2          ad.find('a', class_='result-title hdrlnk')['data-id'],
----> 3          ad.find('span', class_='result-price').get_text()] for ad in ads ]

<ipython-input-31-5f66cb53d05d> in <listcomp>(.0)
      1 data = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
      2          ad.find('a', class_='result-title hdrlnk')['data-id'],
----> 3          ad.find('span', class_='result-price').get_text()] for ad in ads ]

AttributeError: 'NoneType' object has no attribute 'get_text'

What's going wrong? Some ads don't have a price listed, so we can't retrieve it.


In [32]:
# if it exists then the type is
type(ads[0].find('span', class_='result-price'))


Out[32]:
bs4.element.Tag

If it does not find the price, it returns a NoneType. We might exploit this fact to select only the valid links.


In [33]:
import bs4

data = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads 
            if type(ad.find('span', class_='result-price'))==bs4.element.Tag]

In [34]:
data[:10]


Out[34]:
[['$1550 1 BR available in 2BR/1Ba Apt in East Village',
  '6073647116',
  '$1550'],
 ['1 Bedroom - East Village Bldg - Doorman', '6065914838', '$1595'],
 ['Cozy room in huge apartment in the heart of the east village',
  '6072717871',
  '$900'],
 ['East Village Furnished Room Sublet in a Sunny Apt. Amazing Location.',
  '6072672057',
  '$1500'],
 ['Great East Village Location, Facing Garden!', '6050790054', '$1100'],
 ['One bedroom in three bedroom East Village doorman/elevator apartment',
  '6072517459',
  '$1734'],
 ['Room In East Village (Modern Building)', '6072443769', '$1750'],
 ['Furnished bedroom in East Village-----May 1st', '6068586529', '$1700'],
 ['Beautiful sunny bedroom in great East Village/Lower East Side apt!',
  '6066076874',
  '$1300'],
 ['1BR Furnished in Furnished 3BR/1BA, Heart of the East Village',
  '6071631455',
  '$1650']]

In [35]:
df = pd.DataFrame(data)

In [36]:
df.head(10)


Out[36]:
0 1 2
0 $1550 1 BR available in 2BR/1Ba Apt in East Vi... 6073647116 $1550
1 1 Bedroom - East Village Bldg - Doorman 6065914838 $1595
2 Cozy room in huge apartment in the heart of th... 6072717871 $900
3 East Village Furnished Room Sublet in a Sunny ... 6072672057 $1500
4 Great East Village Location, Facing Garden! 6050790054 $1100
5 One bedroom in three bedroom East Village door... 6072517459 $1734
6 Room In East Village (Modern Building) 6072443769 $1750
7 Furnished bedroom in East Village-----May 1st 6068586529 $1700
8 Beautiful sunny bedroom in great East Village/... 6066076874 $1300
9 1BR Furnished in Furnished 3BR/1BA, Heart of t... 6071631455 $1650

In [37]:
df.shape


Out[37]:
(118, 3)

We only have 118 listing because 2 listings did not have a price.


In [38]:
df.columns = ['Title', 'ID', 'Price']

In [39]:
df.head()


Out[39]:
Title ID Price
0 $1550 1 BR available in 2BR/1Ba Apt in East Vi... 6073647116 $1550
1 1 Bedroom - East Village Bldg - Doorman 6065914838 $1595
2 Cozy room in huge apartment in the heart of th... 6072717871 $900
3 East Village Furnished Room Sublet in a Sunny ... 6072672057 $1500
4 Great East Village Location, Facing Garden! 6050790054 $1100

We could do text anaylsis and see what words are common in ads which has a relatively higher price.

This approach is not really efficient because it only gets the first page of the search results. We see on the top of the CL page the total number of listings. In the Inspection mode we can pick an element from the page and check how it is defined in the html -- this is useful to get tags and classes efficiently.

For example, the total number of ads is a span tag with a 'totalcount' class.


In [40]:
cl_soup.find('span', class_='totalcount')


Out[40]:
<span class="totalcount">488</span>

We can see if we start clicking on the 2nd nd 3rd pages of the results that there is a structure in how they are defined

First page:

https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0

Second page:

https://newyork.craigslist.org/search/roo?s=120&availabilityMode=0&query=east%20village

Third page:

https://newyork.craigslist.org/search/roo?s=240&availabilityMode=0&query=east%20village

The number after roo?s= in the domain specifies where the listings are starting from (not inclusive). In fact, if we modify it ourselves we can fine-tune the page starting from the corresponding listing and then showing 120 listings. Try it!

We can also define the first page by puttig s=0& after roo? like this:

https://newyork.craigslist.org/search/roo?s=0&availabilityMode=0&query=east%20village


In [41]:
# First we get the total number of listings in real time
url = 'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'
cl = requests.get(url)

cl_soup = BeautifulSoup(cl.content, 'html.parser')
total_count = int(cl_soup.find('span', class_='totalcount').get_text())
print(total_count)


488

We have the total number of listings with the given search specification. Breaking down the steps:

1) Specify the url of each page we want to scrape

2) For each page scrape the data -- we will reuse the code what we already have for one page

3) Save the data into one dataframe -- we can use the append method for DataFrames or the extend method for lists


In [42]:
# 1) Specify the url
for page in range(0, total_count, 120):
    print('https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page))


https://newyork.craigslist.org/search/roo?s=0&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=120&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=240&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=360&availabilityMode=0&query=east%20village
https://newyork.craigslist.org/search/roo?s=480&availabilityMode=0&query=east%20village

In [43]:
# Next we write a loop to scrape all pages

df = pd.DataFrame({'Title' : [], 'ID' : [], 'Price' : []})

for page in range(0, total_count, 120):
    url = 'https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page)
    
    cl = requests.get(url)
    cl_soup = BeautifulSoup(cl.content, 'html.parser')
    
    ads = cl_soup.find_all('li', class_='result-row')
    data = pd.DataFrame([[ad.find('a', class_='result-title hdrlnk').get_text(), 
                          ad.find('a', class_='result-title hdrlnk')['data-id'], 
                          ad.find('span', class_='result-price').get_text()] for ad in ads 
                                 if type(ad.find('span', class_='result-price'))==bs4.element.Tag], 
                        columns=['Title', 'ID', 'Price'])
    
    df = df.append(data, ignore_index=True)

In [44]:
df.head()


Out[44]:
ID Price Title
0 6073647116 $1550 $1550 1 BR available in 2BR/1Ba Apt in East Vi...
1 6065914838 $1595 1 Bedroom - East Village Bldg - Doorman
2 6072717871 $900 Cozy room in huge apartment in the heart of th...
3 6072672057 $1500 East Village Furnished Room Sublet in a Sunny ...
4 6050790054 $1100 Great East Village Location, Facing Garden!

In [45]:
# Do the same using the `extend` method

data = []
for page in range(0, total_count, 120):
    url = 'https://newyork.craigslist.org/search/roo?s={}&availabilityMode=0&query=east%20village'.format(page)
    cl = requests.get(url)
    cl_soup = BeautifulSoup(cl.content, 'html.parser')
    ads = cl_soup.find_all('li', class_='result-row')
    data_page = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads 
        if type(ad.find('span', class_='result-price'))==bs4.element.Tag]
    data.extend(data_page)
    
df = pd.DataFrame(data, columns=['Title', 'ID', 'Price'])

In [46]:
df.head()


Out[46]:
Title ID Price
0 $1550 1 BR available in 2BR/1Ba Apt in East Vi... 6073647116 $1550
1 1 Bedroom - East Village Bldg - Doorman 6065914838 $1595
2 Cozy room in huge apartment in the heart of th... 6072717871 $900
3 East Village Furnished Room Sublet in a Sunny ... 6072672057 $1500
4 Great East Village Location, Facing Garden! 6050790054 $1100

In [47]:
df.shape


Out[47]:
(464, 3)

In [48]:
df.tail()


Out[48]:
Title ID Price
459 Spacious NYC Apartments, Huge Rooms, Utilities... 6064938758 $1400
460 Spacious NYC Apartments, Huge Rooms, Utilities... 6064571433 $1400
461 $1000/120ft² - Serene Share in StuyTown - Perf... 6064283992 $1000
462 Spacious NYC Apartments, Huge Rooms, Utilities... 6064231299 $1400
463 LARGE & SUNNY ROOM IN STUYTOWN - COME AND CHEC... 6063952782 $1650

We have scraped all the listings from CL in section "Rooms and Shares" for the East Village.

Exercise

Suppose you have a couple of destinations in mind and you want to check the weather for each of them for this Friday. You want to get it from the National Weather Service.

These are the places I want to check (suppose there are many more and you want to automate it):

locations = ['Bozeman, Montana', 'White Sands National Monument', 'Stanford University, California']

It seems that the NWS is using latitude and longitude coordinates in its search.

i.e. for White Sands http://forecast.weather.gov/MapClick.php?lat=32.38092788700044&lon=-106.4794398029997

Would be cool to pass these on as arguments.

After some Google fu (i.e. "latitude and longitude of location python") find a post by Chris Albon which describes exactly what we want.

"Geocoding (converting a phyiscal address or location into latitude/longitude) and reverse geocoding (converting a lat/long to a phyiscal address or location)[...] Python offers a number of packages to make the task incredibly easy [...] use pygeocoder, a wrapper for Google's geo-API, to both geocode and reverse geocode.

Install pygeocoder through pip install pygeocoder (from conda only the OSX version is available).


In [50]:
from pygeocoder import Geocoder

In [51]:
# check for one of the locations how it's working
# some addresses might not be valid -- it goes through Google's API

loc = Geocoder.geocode('Bozeman, Montana')
loc.coordinates


Out[51]:
(45.6769979, -111.0429339)

In [52]:
Geocoder.geocode('Stanford, California').coordinates


Out[52]:
(37.42410599999999, -122.1660756)

We can check whether it's working fine at http://www.latlong.net/


In [53]:
locations = ['Bozeman, Montana', 'White Sands National Monument', 'Stanford University, California']

coordinates = [Geocoder.geocode(location).coordinates for location in locations]

In [54]:
coordinates


Out[54]:
[(45.6769979, -111.0429339),
 (32.7872403, -106.3256816),
 (37.4274745, -122.169719)]

In [55]:
for location, coordinate in zip(locations, coordinates):
    print('The coordinates of {} are:'.format(location), coordinate)


The coordinates of Bozeman, Montana are: (45.6769979, -111.0429339)
The coordinates of White Sands National Monument are: (32.7872403, -106.3256816)
The coordinates of Stanford University, California are: (37.4274745, -122.169719)

Define a dictionary for the parameters we want to pass to the GET request for NWS server.


In [56]:
keys = {}
for location, coordinate in zip(locations, coordinates):
    keys[location] = {'lat' : coordinate[0], 'lon' : coordinate[1]}

In [57]:
keys


Out[57]:
{'Bozeman, Montana': {'lat': 45.6769979, 'lon': -111.0429339},
 'Stanford University, California': {'lat': 37.4274745, 'lon': -122.169719},
 'White Sands National Monument': {'lat': 32.7872403, 'lon': -106.3256816}}

Recall the format of the url associated with a particular location

http://forecast.weather.gov/MapClick.php?lat=32.38092788700044&lon=-106.4794398029997


In [58]:
keys[locations[0]]


Out[58]:
{'lat': 45.6769979, 'lon': -111.0429339}

In [59]:
url = 'http://forecast.weather.gov/MapClick.php'    
nws = requests.get(url, params=keys[locations[0]])

In [60]:
nws.status_code


Out[60]:
200

In [61]:
nws.url


Out[61]:
'http://forecast.weather.gov/MapClick.php?lon=-111.0429339&lat=45.6769979'

In [62]:
nws.content[:300]


Out[62]:
b'<!DOCTYPE html>\n<html class="no-js">\n    <head>\n        <!-- Meta -->\n        <meta name="viewport" content="width=device-width">\n        <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" /><title>National Weather Service</title><meta name="DC.title" content="National Weather Service" />'

In [63]:
nws_soup = BeautifulSoup(nws.content, 'html.parser')

In [64]:
seven = nws_soup.find('div', id='seven-day-forecast-container')

In [65]:
seven.find(text='Friday')


Out[65]:
'Friday'

In [66]:
seven.find(text='Friday').parent


Out[66]:
<p class="period-name">Friday<br><br/></br></p>

In [67]:
seven.find(text='Friday').parent.parent


Out[67]:
<div class="tombstone-container">
<p class="period-name">Friday<br><br/></br></p>
<p><img alt="Friday: A 20 percent chance of showers.  Partly sunny, with a high near 62." class="forecast-icon" src="newimages/medium/shra20.png" title="Friday: A 20 percent chance of showers.  Partly sunny, with a high near 62."/></p><p class="short-desc">Slight Chance<br>Showers</br></p><p class="temp temp-high">High: 62 °F</p></div>

In [68]:
seven.find(text='Friday').parent.parent.find('p', class_='temp temp-high').get_text()


Out[68]:
'High: 62 °F'

In [69]:
data = []

for location in locations:
    
    nws = requests.get(url, params=keys[location])
    nws_soup = BeautifulSoup(nws.content, 'html.parser')
    
    seven = nws_soup.find('div', id='seven-day-forecast-container')
    temp = seven.find(text='Friday').parent.parent.find('p', class_='temp temp-high').get_text()
    data.append([location, temp])

In [70]:
df_weather = pd.DataFrame(data, columns=['Location', 'Friday weather'])

In [71]:
df_weather


Out[71]:
Location Friday weather
0 Bozeman, Montana High: 62 °F
1 White Sands National Monument High: 83 °F
2 Stanford University, California High: 66 °F

In [72]:
df_weather['high_temp'] = df_weather['Friday weather'].str.rsplit().str.get(1).astype(float)

In [73]:
df_weather['high_temp'].std()


Out[73]:
11.150485789118488

In [ ]: