Title: Beautiful Soup Basic HTML Scraping
Slug: beautiful_soup_html_basics
Summary: Beautiful Soup Basic HTML Scraping
Date: 2016-05-01 12:00
Category: Python
Tags: Web Scraping
Authors: Chris Albon

Import the modules



In [79]:

    
# Import required modules
import requests
from bs4 import BeautifulSoup

Scrap the html and turn into a beautiful soup object



In [80]:

    
# Create a variable with the url
url = 'http://chrisralbon.com'

# Use requests to get the contents
r = requests.get(url)

# Get the text of the contents
html_content = r.text

# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'lxml')

Select the website's title



In [81]:

    
# View the title tag of the soup object
soup.title









    Out[81]:





<title>Chris Albon</title>

Website title tag's string



In [82]:

    
# View the string within the title tag
soup.title.string









    Out[82]:





'Chris Albon'

First paragraph tag



In [83]:

    
# view the paragraph tag of the soup
soup.p









    Out[83]:





<p>I am a <a href="./pages/about.html">data scientist originally trained as a quantitative political scientist</a>. I specialize in the technical and organizational aspects of applying data science to political and social issues. </p>

The parent of the title tag



In [84]:

    
soup.title.parent.name









    Out[84]:





'head'

The first link tag



In [85]:

    
soup.a









    Out[85]:





<a class="navbar-brand" href=".">Chris Albon</a>

Find all the link tags and list the first five



In [86]:

    
soup.find_all('a')[0:5]









    Out[86]:





[<a class="navbar-brand" href=".">Chris Albon</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" role="button">About<span class="caret"></span></a>,
 <a href="./pages/about.html">About Chris</a>,
 <a href="https://github.com/chrisalbon">GitHub</a>,
 <a href="https://twitter.com/chrisalbon">Twitter</a>]

The string inside the first paragraph tag



In [87]:

    
soup.p.string

Find all the h2 tags and list the first five



In [88]:

    
soup.find_all('h2')[0:5]









    Out[88]:





[<h2 class="homepage_category_title">Articles</h2>,
 <h2 class="homepage_category_title">Projects</h2>,
 <h2 class="homepage_category_title">Python</h2>,
 <h2 class="homepage_category_title">R Stats</h2>,
 <h2 class="homepage_category_title">Regex</h2>]

Find all the links on the page and list the first five



In [89]:

    
soup.find_all('a')[0:5]









    Out[89]:





[<a class="navbar-brand" href=".">Chris Albon</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" role="button">About<span class="caret"></span></a>,
 <a href="./pages/about.html">About Chris</a>,
 <a href="https://github.com/chrisalbon">GitHub</a>,
 <a href="https://twitter.com/chrisalbon">Twitter</a>]