This notebook provides introduction to web (data) scraping using requests, BeautifulSoup and Pandas libraries.
BeautifulSoup - provides a lot of nice functions for scraping data in Python, that are only applicable for BeautilfulSoup elements. Thus, first of all BeautilfulSoup() function should be used to convert the page into BeautifulSoup type of object, and then apply the below mentioned functions.
- find_all() - when used on BeautifulSoup element, will find all the tags required. Class or ID can be specified as well to make the search more precise. This function returns a list of BeautifulSoup elements.
- find() - works as find_all(), however finds only the very first element matching the criteria. Is useful especially when one is sure only one element satisfies the required criteria. Returns BeautifulSoup element directly.
- get_text() - provide only text content from BeautifulSoup element (i.e. without tags).
- get() - gets the value of required argument, i.e. get("href") will provide the hyperlink mentioned in some <a> tag.
This notebook provides the following cases:
In [2]:
import numpy as np #for numeric operations
import pandas as pd #for dealing with dataframes
import matplotlib.pyplot as plt #for visualization
import requests #get html
from bs4 import BeautifulSoup #for scraping
from pprint import pprint #for pretty printing
In [3]:
#robots.txt checked
url = "http://quotes.toscrape.com/"
In [6]:
#get html
response = requests.get(url)
response.status_code #200 then good, 404 then bad
Out[6]:
In [22]:
page=response.content #html to scrape
page = BeautifulSoup(page,"html.parser") #change type
In [39]:
#find all hashtags, extract text from the first
hashtags_html = page.find_all("a",class_ = "tag") #tag finder
hashtags_html[0].get_text() #get text from the whole tag
Out[39]:
In [43]:
#extract text from all hashtags and save it
#Appraoch 1: for loop
hashtags = []
for i in hashtags_html:
hashtags.append(i.get_text())
print(hashtags)
In [45]:
#Appraoch 2: get texts using list comprehension
hashtags = [i.get_text() for i in hashtags_html]
print(hashtags)
In [52]:
#from now on, use list comprehensions for simplicity
#get links
links = [i.get("href") for i in hashtags_html]
pprint(links)
In [63]:
#get hashtags from links, treating links as string
hashtags_from_links = [i.split("/")[2] for i in links]
In [66]:
authors_html = page.find_all("small",class_ = "author")
authors = [i.get_text() for i in authors_html]
print(authors)
In [ ]:
#get the HTML page and make it BeautifulSoup type
url = "http://books.toscrape.com/"
response = requests.get(url)
page = response.content
page = BeautifulSoup(page,"html.parser")
In [ ]:
#find tags with prices, get text only, convert to numeric format
prices_html = page.find_all("p",class_="price_color")
prices = [i.get_text() for i in prices_html]
prices_numeric = [float(i.replace("£","")) for i in prices]
print(prices_numeric)
In [ ]:
#convert prices back to text and save in a txt file (each on new line)
with open("prices.txt","w") as f:
for i in prices_numeric:
f.write(str(i)+"\n")
Now we will try to get book titles. The difficulty of hte task is that its link does not have an identifier. We will try 3 solutions, first will not work, the other 2 will.
In [ ]:
#trial 1 - parent search (did not work)
article_tag = page.find_all("article",class_="product_pod")
titles_wrong = [i.get_text() for i in article_tag]
In [ ]:
#trial 2 - for loop
#find all h3 tags then first (only) a tags inside
h3s = page.find_all("h3")
titles = [i.find("a").get_text() for i in h3s]
In [ ]:
#trial 3 - content of <a>
#find all a tags, then choose only those that
#have the keyword title inside and get their text
a_tags = page.find_all("a")
titles = []
for i in a_tags:
if str(i).find("title=")>-1:
text = i.get_text()
titles.append(text)
In [ ]:
url = "https://www.cbinsights.com/research-unicorn-companies"
#as the data is provided inside an HTML table, we can use pandas to read it
data = pd.read_html(url)
#result is a list of one lement, which turns out to be a daraframe
type(data[0])
In [ ]:
startups = data[0]
startups.head()
In [ ]:
url_career = "https://careercenter.am/ccidxann.php"
tables_career = pd.read_html(url_career)
len(tables_career)
In [ ]:
#it turned out there were 5 tables on this website, we ar einterested in the first one
tables_career[0].head()
In [ ]:
tables_career[0].to_csv("jobs.csv")