Python for Webscraping

  • SOC 590: Big Data and Population Processes
  • 17th October 2016

Tutorial 1: Webscraping a table to .csv

Outline

  1. Import modules
  2. Examine html structure of webpage
  3. Use Beautiful Soup to parse webpage into .csv
  4. Analyze .csv of webpage as a Pandas DataFrame
  5. Use plotting libraries to visualize data

In [1]:
import os
import csv
import urllib
import webbrowser
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from altair import * 
from bs4 import BeautifulSoup
%matplotlib inline
  • make a data folder

In [2]:
if not os.path.exists('../data/'):
    os.makedirs('../data/')
  • open List of popular websites on Wikipedia
  • view page source to see html

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_most_popular_websites'
webbrowser.open_new_tab(url)


Out[3]:
True
  • make a soup for List of popular websites table with html table tags
  • save soup as wiki_popular_sites.csv

In [4]:
soup = BeautifulSoup(urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_most_popular_websites'), "lxml")

# <th> tag: defines a header cell in a table
# <tr> tag: defines a row in a table 
# <td> tag: defines a cell in a table
headers = [header.text for header in soup.findAll('th')]
rows = []
headers
for row in soup.find_all('tr')[1:]:
    rows.append([val.text for val in row.find_all_next('td',limit=6)])
    

with open('../data/wiki_popular_sites.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(row for row in rows if row)
  • read in wiki_popular_sites.csv with pandas

In [5]:
wiki_popular_sites = pd.read_csv("../data/wiki_popular_sites.csv")
wiki_popular_sites.head(10)


Out[5]:
Site Domain Alexa top 100 websites (As of September 4, 2016)[3] SimilarWeb top 100 websites (As of September 4, 2016)[4] Type Principal country
0 Google google.com 1 2 Internet services and products U.S.
1 YouTube youtube.com 2 3 Video sharing U.S.
2 Facebook facebook.com 3 1 Social network U.S.
3 Baidu baidu.com 4 16 Search engine China
4 Yahoo! yahoo.com 5 5 Portal and media U.S.
5 Amazon amazon.com 6 10 E-commerce and cloud computing U.S.
6 Wikipedia wikipedia.org 7 7 Encyclopedia U.S.
7 Tencent QQ qq.com 8 41 Portal China
8 Google India google.co.in 9 14 Search engine India
9 Twitter twitter.com 10 8 Social network U.S.

In [6]:
wiki_popular_sites.columns


Out[6]:
Index(['Site', 'Domain',
       'Alexa top 100 websites\n(As of September 4, 2016)[3]',
       'SimilarWeb top 100 websites\n(As of September 4, 2016)[4]', 'Type',
       '\nPrincipal country\n'],
      dtype='object')
  • rename columns

In [7]:
columns = ["Site", "Domain", "Alexa top 100 websites", "SimilarWeb top 100 websites", "Type", "Principal country"]
wiki_popular_sites.columns = columns
wiki_popular_sites.columns


Out[7]:
Index(['Site', 'Domain', 'Alexa top 100 websites',
       'SimilarWeb top 100 websites', 'Type', 'Principal country'],
      dtype='object')
  • basic descriptives

In [8]:
wiki_popular_sites.describe()


Out[8]:
Site Domain Alexa top 100 websites SimilarWeb top 100 websites Type Principal country
count 124 124 124 124 124 124
unique 124 124 124 121 60 31
top Odnoklassniki yahoo.com 74 N/A[notes 1] Search engine U.S.
freq 1 1 1 3 26 48

In [9]:
wiki_popular_sites['Type'].unique()


Out[9]:
array(['Internet services and products', 'Video sharing', 'Social network',
       'Search engine', 'Portal and media',
       'E-commerce and cloud computing', 'Encyclopedia', 'Portal',
       'Email, web services and software suite', 'Online shopping',
       'Photo sharing and social media', 'Portal and instant messaging',
       'Professional Social network', 'Web directories',
       'social news networking, entertainment',
       'Online auctions and shopping',
       'URL shortening for links on Twitter', 'Retail',
       'Internet security and anti-trojan software', 'E-commerce',
       'Social media', 'Online advertising network',
       'Streaming TV and movies', 'Software and technology',
       'Blogging and social media', 'Newspaper', 'Payment system',
       'Blogging', 'Image sharing', 'Question and answer site',
       'Social Networking', 'Technology and software',
       'Source code hosting service', 'Search Engine', 'Pornography',
       'Film, TV show, and video game database', 'Instant Messaging',
       'Online Office Suite', 'Internet forum', 'Classified advertising',
       'File hosting service', 'News', 'Content marketing',
       'Social network, photo sharing, blogging', 'E-commerce and portal',
       'Software as a service', 'Booking engine', 'Pop-up advertising',
       'Television', 'Entertainment', 'Search engine and web browser',
       'Livestreams', 'Wikis', 'Disney portal', 'Torrents', 'Ad serving',
       'Web portal', 'Advertising', 'Weather forecasting',
       'Webpage transcoder for slow connections'], dtype=object)

In [10]:
wiki_popular_sites['Principal country'].unique()


Out[10]:
array([' U.S.', ' China', ' India', ' Japan', ' Russia', ' Germany', ' UK',
       ' France', ' Brazil', ' Italy', ' Spain', ' Korea', ' Mexico',
       ' Canada', ' Hong Kong', ' Cyprus', ' Turkey', ' Indonesia',
       ' Poland', ' Taiwan', ' Netherlands', ' Australia', ' Costa Rica',
       ' Vietnam', ' U.K.', ' Thailand', ' Argentina', ' Poland[5]',
       ' Ukraine', ' Colombia', '\xa0Philippines'], dtype=object)
  • use pandas groupby function to group websites by Type and Principal coutry
  • .iloc subsets the result to only inlcude the first two columns by index

In [11]:
grouped_sites = wiki_popular_sites.groupby(['Type','Principal country']).count()

In [12]:
grouped_sites.iloc[:,0:1]


Out[12]:
Site
Type Principal country
Ad serving U.S. 1
Advertising U.S. 1
Blogging U.S. 2
Blogging and social media U.S. 1
Booking engine Netherlands 1
Classified advertising U.S. 1
Content marketing U.S. 1
Disney portal U.S. 1
E-commerce China 1
Germany 1
India 1
Japan 2
Russia 1
U.K. 1
UK 1
E-commerce and cloud computing U.S. 1
E-commerce and portal China 1
Email, web services and software suite U.S. 1
Encyclopedia U.S. 1
Entertainment Canada 1
File hosting service U.S. 2
Film, TV show, and video game database U.S. 1
Image sharing U.S. 1
Instant Messaging U.S. 1
Internet forum China 1
Internet security and anti-trojan software China 1
Internet services and products U.S. 1
Livestreams U.S. 1
News China 1
U.K. 1
... ... ...
Search engine U.S. 1
UK 1
Ukraine 1
Vietnam 1
Philippines 1
Search engine and web browser Vietnam 1
Social Networking Russia 1
Social media U.S. 2
Social network China 1
Russia 1
U.S. 2
Social network, photo sharing, blogging Taiwan 1
Software and technology Spain 1
U.S. 1
Software as a service U.S. 1
Source code hosting service U.S. 1
Streaming TV and movies U.S. 1
Technology and software U.S. 1
Television China 1
Torrents Costa Rica 1
URL shortening for links on Twitter U.S. 1
Video sharing China 2
Japan 1
U.S. 1
Weather forecasting U.S. 1
Web directories China 1
Web portal Poland 1
Webpage transcoder for slow connections U.S. 1
Wikis U.S. 1
social news networking, entertainment U.S. 1

107 rows × 1 columns

  • create a pivot table aggregating all countries in the table

In [13]:
by_country = wiki_popular_sites.pivot_table('Type', aggfunc='count', columns='Principal country')
by_country.head(10)


Out[13]:
Principal country
 Argentina      1
 Australia      1
 Brazil         3
 Canada         3
 China         23
 Colombia       1
 Costa Rica     2
 Cyprus         1
 France         1
 Germany        2
Name: Type, dtype: int64
  • make a historgram of countries represented in the Wikipedia data set

In [14]:
fig, ax = plt.subplots(1, figsize=(12, 8))
fig.subplots_adjust(hspace=0.4)
plt.hist(by_country, bins = 50, color='blue')
plt.xticks(np.arange(0, by_country.max()+1, 2))
plt.yticks(np.arange(0, by_country.value_counts().max()+10, 5))
plt.xlabel('Count')
plt.ylabel('Frequency')
plt.title('Histogram of country frequency in \n List of Popular Websites (Wikipedia)')
plt.show()


  • convert by_country series to a data frame for plotting with altair

In [15]:
by_country_df = pd.DataFrame(by_country)

In [16]:
Chart(by_country_df).mark_bar().encode(
    x=X('Type:Q',
        bin=Bin(
            maxbins=50.0,
        ),
    ),
    y='count(*):Q',
       
)


Review

In this tutorial, we learned how to examine the html structure of webpage and use the Beautiful Soup module to parse a single table on a webpage into .csv. After creating a .csv of the webpage table, we analyzed the .csv using the Pandas module. Lastly, we created vizualizations with plotting libraries


In [17]:
import sys
import matplotlib
import altair
import bs4
print("System and module version information: \n")
print('Python version:', sys.version_info)
print('urllib.request version:', urllib.request.__version__)
print('numpy version:',np.__version__)
print('pandas version:', pd.__version__)
print('matplotlib version:',matplotlib.__version__)
print('altair version:',altair.__version__)
print('Beautiful Soup version:', bs4.__version__)


System and module version information: 

Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
urllib.request version: 3.5
numpy version: 1.11.1
pandas version: 0.18.1
matplotlib version: 1.5.3
altair version: 1.0.0
Beautiful Soup version: 4.5.1