Getting content from the Internet

First we use dotenv to get the settings from the .env file.


In [6]:
import os
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)

# Get the base URL of the data
BASE_URL = os.environ.get("BASE_URL")

print("Source data URL is    : {0}".format(BASE_URL))


Source data URL is    : https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/

requests and BeautifulSoup package

The requests package allows you to get content from a URL.
BeautifulSoup is what you then use to clean it up and find content.


In [7]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Package the request, send the request and catch the response: r
r=requests.get(BASE_URL)

# Extracts the response as html: html_doc
html_doc=r.text
print ("================================================")
print ("This is what the raw results looks like:")
print ("================================================")
print (html_doc)
print ("================================================")


================================================
This is what the raw results looks like:
================================================
<html><head><title>ftp.cdc.gov - /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/</title></head><body><H1>ftp.cdc.gov - /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/</H1><hr>

<pre><A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/">[To Parent Directory]</A><br><br>    Tuesday, December 6, 2016  1:29 PM      6098348 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/cancerxx.zip">cancerxx.zip</A><br>     Wednesday, June 22, 2016 10:55 AM      2430305 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/familyxx.zip">familyxx.zip</A><br>     Wednesday, June 22, 2016 10:55 AM       787954 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/fmlydisb.zip">fmlydisb.zip</A><br>     Wednesday, June 22, 2016 10:55 AM       498083 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/funcdisb.zip">funcdisb.zip</A><br>     Wednesday, June 22, 2016 10:55 AM      1081686 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/househld.zip">househld.zip</A><br>     Wednesday, June 22, 2016 10:55 AM       129497 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/injpoiep.zip">injpoiep.zip</A><br>     Wednesday, June 22, 2016 10:55 AM      2903983 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/paradata.zip">paradata.zip</A><br>     Wednesday, June 22, 2016 10:55 AM     10107724 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/personsx.zip">personsx.zip</A><br>     Wednesday, June 22, 2016 10:55 AM      8052515 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samadult.zip">samadult.zip</A><br>     Wednesday, June 22, 2016 10:55 AM       906007 <A HREF="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samchild.zip">samchild.zip</A><br></pre><hr></body></html>
================================================

In [8]:
# Create a BeautifulSoup object from the HTML: soup
soup=BeautifulSoup(html_doc,'lxml')

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup=soup.prettify()
print ("================================================")
print ("This is what the prettified contents looks like:")
print ("================================================")
print (pretty_soup)
print ("================================================")


================================================
This is what the prettified contents looks like:
================================================
<html>
 <head>
  <title>
   ftp.cdc.gov - /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/
  </title>
 </head>
 <body>
  <h1>
   ftp.cdc.gov - /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/
  </h1>
  <hr/>
  <pre><a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/">[To Parent Directory]</a><br/><br/>    Tuesday, December 6, 2016  1:29 PM      6098348 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/cancerxx.zip">cancerxx.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM      2430305 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/familyxx.zip">familyxx.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM       787954 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/fmlydisb.zip">fmlydisb.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM       498083 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/funcdisb.zip">funcdisb.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM      1081686 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/househld.zip">househld.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM       129497 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/injpoiep.zip">injpoiep.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM      2903983 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/paradata.zip">paradata.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM     10107724 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/personsx.zip">personsx.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM      8052515 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samadult.zip">samadult.zip</a><br/>     Wednesday, June 22, 2016 10:55 AM       906007 <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samchild.zip">samchild.zip</a><br/></pre>
  <hr/>
 </body>
</html>
================================================

Replacing content

For instance, in this case, the <pre> tag is preventing the content to be nicely formatted. So why not just remove it and replace it with it's child elements


In [9]:
soup=BeautifulSoup(html_doc,'lxml')
soup.pre.replace_with_children()
print ("================================================")
print ("The impact of removing the <pre></pre> tag:")
print ("================================================")
print(soup.prettify())
print ("================================================")


================================================
The impact of removing the <pre></pre> tag:
================================================
<html>
 <head>
  <title>
   ftp.cdc.gov - /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/
  </title>
 </head>
 <body>
  <h1>
   ftp.cdc.gov - /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/
  </h1>
  <hr/>
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/">
   [To Parent Directory]
  </a>
  <br/>
  <br/>
  Tuesday, December 6, 2016  1:29 PM      6098348
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/cancerxx.zip">
   cancerxx.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM      2430305
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/familyxx.zip">
   familyxx.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM       787954
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/fmlydisb.zip">
   fmlydisb.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM       498083
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/funcdisb.zip">
   funcdisb.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM      1081686
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/househld.zip">
   househld.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM       129497
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/injpoiep.zip">
   injpoiep.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM      2903983
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/paradata.zip">
   paradata.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM     10107724
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/personsx.zip">
   personsx.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM      8052515
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samadult.zip">
   samadult.zip
  </a>
  <br/>
  Wednesday, June 22, 2016 10:55 AM       906007
  <a href="/pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samchild.zip">
   samchild.zip
  </a>
  <br/>
  <hr/>
 </body>
</html>
================================================

Finding special tags

Another neat feature of BeautifulSoup is the ability to locate tags by name.


In [10]:
# Find the links
links=soup.find_all('a')

# use link.get('href') to get contents of the href attribute 
for link in links:
    print ('URL: '+link.get('href') +'.')


URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/cancerxx.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/familyxx.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/fmlydisb.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/funcdisb.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/househld.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/injpoiep.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/paradata.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/personsx.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samadult.zip.
URL: /pub/Health_Statistics/NCHS/Datasets/NHIS/2015/samchild.zip.