Title: Beautiful Soup Basic HTML Scraping
Slug: beautiful_soup_html_basics
Summary: Beautiful Soup Basic HTML Scraping
Date: 2016-05-01 12:00
Category: Python
Tags: Web Scraping
Authors: Chris Albon

Import the modules


In [79]:
# Import required modules
import requests
from bs4 import BeautifulSoup

Scrap the html and turn into a beautiful soup object


In [80]:
# Create a variable with the url
url = 'http://chrisralbon.com'

# Use requests to get the contents
r = requests.get(url)

# Get the text of the contents
html_content = r.text

# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'lxml')

Select the website's title


In [81]:
# View the title tag of the soup object
soup.title


Out[81]:
<title>Chris Albon</title>

Website title tag's string


In [82]:
# View the string within the title tag
soup.title.string


Out[82]:
'Chris Albon'

First paragraph tag


In [83]:
# view the paragraph tag of the soup
soup.p


Out[83]:
<p>I am a <a href="./pages/about.html">data scientist originally trained as a quantitative political scientist</a>. I specialize in the technical and organizational aspects of applying data science to political and social issues. </p>

The parent of the title tag


In [84]:
soup.title.parent.name


Out[84]:
'head'

In [85]:
soup.a


Out[85]:
<a class="navbar-brand" href=".">Chris Albon</a>

In [86]:
soup.find_all('a')[0:5]


Out[86]:
[<a class="navbar-brand" href=".">Chris Albon</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" role="button">About<span class="caret"></span></a>,
 <a href="./pages/about.html">About Chris</a>,
 <a href="https://github.com/chrisalbon">GitHub</a>,
 <a href="https://twitter.com/chrisalbon">Twitter</a>]

The string inside the first paragraph tag


In [87]:
soup.p.string

Find all the h2 tags and list the first five


In [88]:
soup.find_all('h2')[0:5]


Out[88]:
[<h2 class="homepage_category_title">Articles</h2>,
 <h2 class="homepage_category_title">Projects</h2>,
 <h2 class="homepage_category_title">Python</h2>,
 <h2 class="homepage_category_title">R Stats</h2>,
 <h2 class="homepage_category_title">Regex</h2>]

In [89]:
soup.find_all('a')[0:5]


Out[89]:
[<a class="navbar-brand" href=".">Chris Albon</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" role="button">About<span class="caret"></span></a>,
 <a href="./pages/about.html">About Chris</a>,
 <a href="https://github.com/chrisalbon">GitHub</a>,
 <a href="https://twitter.com/chrisalbon">Twitter</a>]