Scraping HTML pages

This notebook provides introduction to web (data) scraping using requests, BeautifulSoup and Pandas libraries.

Descriptions

  • requests - provides a get() function that gets the HTML source of the given page. status_code method tells us whether the request was succesfully responded (i.e. 200) or not (i.e. 404 error). In case of success once may use the content method to get only the HTML content (without status code etc.) for further scraping.
  • pandas - the most popular Python package for data analysis and manipulation. Provides many nice functions for reading different files, including a read_html() function for reading HTML tables. Attention!: it will read only tables and save the output as a list of dataframes. Nothing else than the text content of tags will be sabed.
  • BeautifulSoup - provides a lot of nice functions for scraping data in Python, that are only applicable for BeautilfulSoup elements. Thus, first of all BeautilfulSoup() function should be used to convert the page into BeautifulSoup type of object, and then apply the below mentioned functions.

      - find_all() - when used on BeautifulSoup element, will find all the tags required. Class or ID can be specified as well to make the search more precise. This function returns a list of BeautifulSoup elements.
      - find() - works as find_all(), however finds only the very first element matching the criteria. Is useful especially when one is sure only one element satisfies the required criteria. Returns BeautifulSoup element directly.
      - get_text() - provide only text content from BeautifulSoup element (i.e. without tags).
      - get() - gets the value of required argument, i.e. get("href") will provide the hyperlink mentioned in some <a> tag.

This notebook provides the following cases:

  1. Quotes to Scrape,
  2. Books to Scrape,
  3. Unicorn companies, CBInsights
  4. Careercenter

In [2]:
import numpy as np #for numeric operations
import pandas as pd #for dealing with dataframes
import matplotlib.pyplot as plt #for visualization

import requests #get html
from bs4 import BeautifulSoup #for scraping
from pprint import pprint #for pretty printing

Case 1: Quotes to Scrape


In [3]:
#robots.txt checked
url = "http://quotes.toscrape.com/"

In [6]:
#get html
response = requests.get(url)
response.status_code #200 then good, 404 then bad


Out[6]:
200

In [22]:
page=response.content #html to scrape
page = BeautifulSoup(page,"html.parser") #change type

In [39]:
#find all hashtags, extract text from the first
hashtags_html = page.find_all("a",class_ = "tag") #tag finder
hashtags_html[0].get_text() #get text from the whole tag


Out[39]:
'change'

In [43]:
#extract text from all hashtags and save it
#Appraoch 1: for loop
hashtags = []
for i in hashtags_html:
    hashtags.append(i.get_text())
print(hashtags)


['change', 'deep-thoughts', 'thinking', 'world', 'abilities', 'choices', 'inspirational', 'life', 'live', 'miracle', 'miracles', 'aliteracy', 'books', 'classic', 'humor', 'be-yourself', 'inspirational', 'adulthood', 'success', 'value', 'life', 'love', 'edison', 'failure', 'inspirational', 'paraphrased', 'misattributed-eleanor-roosevelt', 'humor', 'obvious', 'simile', 'love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']

In [45]:
#Appraoch 2: get texts using list comprehension
hashtags = [i.get_text() for i in hashtags_html] 
print(hashtags)


['change', 'deep-thoughts', 'thinking', 'world', 'abilities', 'choices', 'inspirational', 'life', 'live', 'miracle', 'miracles', 'aliteracy', 'books', 'classic', 'humor', 'be-yourself', 'inspirational', 'adulthood', 'success', 'value', 'life', 'love', 'edison', 'failure', 'inspirational', 'paraphrased', 'misattributed-eleanor-roosevelt', 'humor', 'obvious', 'simile', 'love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']

In [52]:
#from now on, use list comprehensions for simplicity
#get links
links = [i.get("href") for i in hashtags_html]
pprint(links)


['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']

In [63]:
#get hashtags from links, treating links as string
hashtags_from_links = [i.split("/")[2] for i in links]

In [66]:
authors_html = page.find_all("small",class_ = "author")
authors = [i.get_text() for i in authors_html]
print(authors)


['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']

Case 2: Books to Scrape


In [ ]:
#get the HTML page and make it BeautifulSoup type
url = "http://books.toscrape.com/"
response  = requests.get(url)
page = response.content
page = BeautifulSoup(page,"html.parser")

In [ ]:
#find tags with prices, get text only, convert to numeric format
prices_html = page.find_all("p",class_="price_color")
prices = [i.get_text() for i in prices_html]
prices_numeric = [float(i.replace("£","")) for i in prices]
print(prices_numeric)

In [ ]:
#convert prices back to text and save in a txt file (each on new line)
with open("prices.txt","w") as f:
    for i in prices_numeric:
        f.write(str(i)+"\n")

Now we will try to get book titles. The difficulty of hte task is that its link does not have an identifier. We will try 3 solutions, first will not work, the other 2 will.


In [ ]:
#trial 1 - parent search (did not work)
article_tag = page.find_all("article",class_="product_pod")
titles_wrong = [i.get_text() for i in article_tag]

In [ ]:
#trial 2 - for loop
#find all h3 tags then first (only) a tags inside
h3s = page.find_all("h3")
titles = [i.find("a").get_text() for i in h3s]

In [ ]:
#trial 3 - content of <a>
#find all a tags, then choose only those that
#have the keyword title inside and get their text
a_tags = page.find_all("a")
titles = []
for i in a_tags:
    if str(i).find("title=")>-1:
        text = i.get_text()
        titles.append(text)

Case 3: Unicorn companies, CBInsights


In [ ]:
url = "https://www.cbinsights.com/research-unicorn-companies"
#as the data is provided inside an HTML table, we can use pandas to read it
data = pd.read_html(url)
#result is a list of one lement, which turns out to be a daraframe
type(data[0])

In [ ]:
startups = data[0]
startups.head()

Case 4: Careercenter


In [ ]:
url_career = "https://careercenter.am/ccidxann.php"
tables_career = pd.read_html(url_career)
len(tables_career)

In [ ]:
#it turned out there were 5 tables on this website, we ar einterested in the first one
tables_career[0].head()

In [ ]:
tables_career[0].to_csv("jobs.csv")