Python for Webscraping

SOC 590: Big Data and Population Processes
17th October 2016

Tutorial 1: Webscraping a table to .csv

Outline

Import modules
Examine html structure of webpage
Use Beautiful Soup to parse webpage into .csv
Analyze .csv of webpage as a Pandas DataFrame
Use plotting libraries to visualize data

import relevant modules
standard library modules:
- os
- csv
- urllib
- webbrowser
open source modules:
- numpy
- pandas
- matplotlib
- altair
- Beautiful Soup



In [1]:

    
import os
import csv
import urllib
import webbrowser
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from altair import * 
from bs4 import BeautifulSoup
%matplotlib inline

make a data folder



In [2]:

    
if not os.path.exists('../data/'):
    os.makedirs('../data/')

open List of popular websites on Wikipedia
view page source to see html



In [3]:

    
url = 'https://en.wikipedia.org/wiki/List_of_most_popular_websites'
webbrowser.open_new_tab(url)









    Out[3]:





True

make a soup for List of popular websites table with html table tags
save soup as wiki_popular_sites.csv



In [4]:

    
soup = BeautifulSoup(urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_most_popular_websites'), "lxml")

# <th> tag: defines a header cell in a table
# <tr> tag: defines a row in a table 
# <td> tag: defines a cell in a table
headers = [header.text for header in soup.findAll('th')]
rows = []
headers
for row in soup.find_all('tr')[1:]:
    rows.append([val.text for val in row.find_all_next('td',limit=6)])
    

with open('../data/wiki_popular_sites.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(row for row in rows if row)

read in wiki_popular_sites.csv with pandas



In [5]:

    
wiki_popular_sites = pd.read_csv("../data/wiki_popular_sites.csv")
wiki_popular_sites.head(10)









    Out[5]:






  
    
      
      Site
      Domain
      Alexa top 100 websites
(As of September 4, 2016)[3]
      SimilarWeb top 100 websites
(As of September 4, 2016)[4]
      Type
      Principal country
    
  
  
    
      0
      Google
      google.com
      1
      2
      Internet services and products
      U.S.
    
    
      1
      YouTube
      youtube.com
      2
      3
      Video sharing
      U.S.
    
    
      2
      Facebook
      facebook.com
      3
      1
      Social network
      U.S.
    
    
      3
      Baidu
      baidu.com
      4
      16
      Search engine
      China
    
    
      4
      Yahoo!
      yahoo.com
      5
      5
      Portal and media
      U.S.
    
    
      5
      Amazon
      amazon.com
      6
      10
      E-commerce and cloud computing
      U.S.
    
    
      6
      Wikipedia
      wikipedia.org
      7
      7
      Encyclopedia
      U.S.
    
    
      7
      Tencent QQ
      qq.com
      8
      41
      Portal
      China
    
    
      8
      Google India
      google.co.in
      9
      14
      Search engine
      India
    
    
      9
      Twitter
      twitter.com
      10
      8
      Social network
      U.S.



In [6]:

    
wiki_popular_sites.columns









    Out[6]:





Index(['Site', 'Domain',
       'Alexa top 100 websites\n(As of September 4, 2016)[3]',
       'SimilarWeb top 100 websites\n(As of September 4, 2016)[4]', 'Type',
       '\nPrincipal country\n'],
      dtype='object')

rename columns



In [7]:

    
columns = ["Site", "Domain", "Alexa top 100 websites", "SimilarWeb top 100 websites", "Type", "Principal country"]
wiki_popular_sites.columns = columns
wiki_popular_sites.columns









    Out[7]:





Index(['Site', 'Domain', 'Alexa top 100 websites',
       'SimilarWeb top 100 websites', 'Type', 'Principal country'],
      dtype='object')

basic descriptives



In [8]:

    
wiki_popular_sites.describe()









    Out[8]:






  
    
      
      Site
      Domain
      Alexa top 100 websites
      SimilarWeb top 100 websites
      Type
      Principal country
    
  
  
    
      count
      124
      124
      124
      124
      124
      124
    
    
      unique
      124
      124
      124
      121
      60
      31
    
    
      top
      Odnoklassniki
      yahoo.com
      74
      N/A[notes 1]
      Search engine
      U.S.
    
    
      freq
      1
      1
      1
      3
      26
      48



In [9]:

    
wiki_popular_sites['Type'].unique()









    Out[9]:





array(['Internet services and products', 'Video sharing', 'Social network',
       'Search engine', 'Portal and media',
       'E-commerce and cloud computing', 'Encyclopedia', 'Portal',
       'Email, web services and software suite', 'Online shopping',
       'Photo sharing and social media', 'Portal and instant messaging',
       'Professional Social network', 'Web directories',
       'social news networking, entertainment',
       'Online auctions and shopping',
       'URL shortening for links on Twitter', 'Retail',
       'Internet security and anti-trojan software', 'E-commerce',
       'Social media', 'Online advertising network',
       'Streaming TV and movies', 'Software and technology',
       'Blogging and social media', 'Newspaper', 'Payment system',
       'Blogging', 'Image sharing', 'Question and answer site',
       'Social Networking', 'Technology and software',
       'Source code hosting service', 'Search Engine', 'Pornography',
       'Film, TV show, and video game database', 'Instant Messaging',
       'Online Office Suite', 'Internet forum', 'Classified advertising',
       'File hosting service', 'News', 'Content marketing',
       'Social network, photo sharing, blogging', 'E-commerce and portal',
       'Software as a service', 'Booking engine', 'Pop-up advertising',
       'Television', 'Entertainment', 'Search engine and web browser',
       'Livestreams', 'Wikis', 'Disney portal', 'Torrents', 'Ad serving',
       'Web portal', 'Advertising', 'Weather forecasting',
       'Webpage transcoder for slow connections'], dtype=object)



In [10]:

    
wiki_popular_sites['Principal country'].unique()









    Out[10]:





array([' U.S.', ' China', ' India', ' Japan', ' Russia', ' Germany', ' UK',
       ' France', ' Brazil', ' Italy', ' Spain', ' Korea', ' Mexico',
       ' Canada', ' Hong Kong', ' Cyprus', ' Turkey', ' Indonesia',
       ' Poland', ' Taiwan', ' Netherlands', ' Australia', ' Costa Rica',
       ' Vietnam', ' U.K.', ' Thailand', ' Argentina', ' Poland[5]',
       ' Ukraine', ' Colombia', '\xa0Philippines'], dtype=object)

use pandas groupby function to group websites by Type and Principal coutry
.iloc subsets the result to only inlcude the first two columns by index



In [11]:

    
grouped_sites = wiki_popular_sites.groupby(['Type','Principal country']).count()



In [12]:

    
grouped_sites.iloc[:,0:1]









    Out[12]:






  
    
      
      
      Site
    
    
      Type
      Principal country
      
    
  
  
    
      Ad serving
      U.S.
      1
    
    
      Advertising
      U.S.
      1
    
    
      Blogging
      U.S.
      2
    
    
      Blogging and social media
      U.S.
      1
    
    
      Booking engine
      Netherlands
      1
    
    
      Classified advertising
      U.S.
      1
    
    
      Content marketing
      U.S.
      1
    
    
      Disney portal
      U.S.
      1
    
    
      E-commerce
      China
      1
    
    
      Germany
      1
    
    
      India
      1
    
    
      Japan
      2
    
    
      Russia
      1
    
    
      U.K.
      1
    
    
      UK
      1
    
    
      E-commerce and cloud computing
      U.S.
      1
    
    
      E-commerce and portal
      China
      1
    
    
      Email, web services and software suite
      U.S.
      1
    
    
      Encyclopedia
      U.S.
      1
    
    
      Entertainment
      Canada
      1
    
    
      File hosting service
      U.S.
      2
    
    
      Film, TV show, and video game database
      U.S.
      1
    
    
      Image sharing
      U.S.
      1
    
    
      Instant Messaging
      U.S.
      1
    
    
      Internet forum
      China
      1
    
    
      Internet security and anti-trojan software
      China
      1
    
    
      Internet services and products
      U.S.
      1
    
    
      Livestreams
      U.S.
      1
    
    
      News
      China
      1
    
    
      U.K.
      1
    
    
      ...
      ...
      ...
    
    
      Search engine
      U.S.
      1
    
    
      UK
      1
    
    
      Ukraine
      1
    
    
      Vietnam
      1
    
    
      Philippines
      1
    
    
      Search engine and web browser
      Vietnam
      1
    
    
      Social Networking
      Russia
      1
    
    
      Social media
      U.S.
      2
    
    
      Social network
      China
      1
    
    
      Russia
      1
    
    
      U.S.
      2
    
    
      Social network, photo sharing, blogging
      Taiwan
      1
    
    
      Software and technology
      Spain
      1
    
    
      U.S.
      1
    
    
      Software as a service
      U.S.
      1
    
    
      Source code hosting service
      U.S.
      1
    
    
      Streaming TV and movies
      U.S.
      1
    
    
      Technology and software
      U.S.
      1
    
    
      Television
      China
      1
    
    
      Torrents
      Costa Rica
      1
    
    
      URL shortening for links on Twitter
      U.S.
      1
    
    
      Video sharing
      China
      2
    
    
      Japan
      1
    
    
      U.S.
      1
    
    
      Weather forecasting
      U.S.
      1
    
    
      Web directories
      China
      1
    
    
      Web portal
      Poland
      1
    
    
      Webpage transcoder for slow connections
      U.S.
      1
    
    
      Wikis
      U.S.
      1
    
    
      social news networking, entertainment
      U.S.
      1
    
  

107 rows × 1 columns

create a pivot table aggregating all countries in the table



In [13]:

    
by_country = wiki_popular_sites.pivot_table('Type', aggfunc='count', columns='Principal country')
by_country.head(10)









    Out[13]:





Principal country
 Argentina      1
 Australia      1
 Brazil         3
 Canada         3
 China         23
 Colombia       1
 Costa Rica     2
 Cyprus         1
 France         1
 Germany        2
Name: Type, dtype: int64

make a historgram of countries represented in the Wikipedia data set



In [14]:

    
fig, ax = plt.subplots(1, figsize=(12, 8))
fig.subplots_adjust(hspace=0.4)
plt.hist(by_country, bins = 50, color='blue')
plt.xticks(np.arange(0, by_country.max()+1, 2))
plt.yticks(np.arange(0, by_country.value_counts().max()+10, 5))
plt.xlabel('Count')
plt.ylabel('Frequency')
plt.title('Histogram of country frequency in \n List of Popular Websites (Wikipedia)')
plt.show()

convert by_country series to a data frame for plotting with altair



In [15]:

    
by_country_df = pd.DataFrame(by_country)



In [16]:

    
Chart(by_country_df).mark_bar().encode(
    x=X('Type:Q',
        bin=Bin(
            maxbins=50.0,
        ),
    ),
    y='count(*):Q',
       
)

Review

In this tutorial, we learned how to examine the html structure of webpage and use the Beautiful Soup module to parse a single table on a webpage into .csv. After creating a .csv of the webpage table, we analyzed the .csv using the Pandas module. Lastly, we created vizualizations with plotting libraries



In [17]:

    
import sys
import matplotlib
import altair
import bs4
print("System and module version information: \n")
print('Python version:', sys.version_info)
print('urllib.request version:', urllib.request.__version__)
print('numpy version:',np.__version__)
print('pandas version:', pd.__version__)
print('matplotlib version:',matplotlib.__version__)
print('altair version:',altair.__version__)
print('Beautiful Soup version:', bs4.__version__)









    



System and module version information: 

Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
urllib.request version: 3.5
numpy version: 1.11.1
pandas version: 0.18.1
matplotlib version: 1.5.3
altair version: 1.0.0
Beautiful Soup version: 4.5.1

	Site	Domain	Alexa top 100 websites (As of September 4, 2016)[3]	SimilarWeb top 100 websites (As of September 4, 2016)[4]	Type	Principal country
0	Google	google.com	1	2	Internet services and products	U.S.
1	YouTube	youtube.com	2	3	Video sharing	U.S.
2	Facebook	facebook.com	3	1	Social network	U.S.
3	Baidu	baidu.com	4	16	Search engine	China
4	Yahoo!	yahoo.com	5	5	Portal and media	U.S.
5	Amazon	amazon.com	6	10	E-commerce and cloud computing	U.S.
6	Wikipedia	wikipedia.org	7	7	Encyclopedia	U.S.
7	Tencent QQ	qq.com	8	41	Portal	China
8	Google India	google.co.in	9	14	Search engine	India
9	Twitter	twitter.com	10	8	Social network	U.S.

	Site	Domain	Alexa top 100 websites	SimilarWeb top 100 websites	Type	Principal country
count	124	124	124	124	124	124
unique	124	124	124	121	60	31
top	Odnoklassniki	yahoo.com	74	N/A[notes 1]	Search engine	U.S.
freq	1	1	1	3	26	48

		Site
Type	Principal country
Ad serving	U.S.	1
Advertising	U.S.	1
Blogging	U.S.	2
Blogging and social media	U.S.	1
Booking engine	Netherlands	1
Classified advertising	U.S.	1
Content marketing	U.S.	1
Disney portal	U.S.	1
E-commerce	China	1
	Germany	1
	India	1
	Japan	2
	Russia	1
	U.K.	1
	UK	1
E-commerce and cloud computing	U.S.	1
E-commerce and portal	China	1
Email, web services and software suite	U.S.	1
Encyclopedia	U.S.	1
Entertainment	Canada	1
File hosting service	U.S.	2
Film, TV show, and video game database	U.S.	1
Image sharing	U.S.	1
Instant Messaging	U.S.	1
Internet forum	China	1
Internet security and anti-trojan software	China	1
Internet services and products	U.S.	1
Livestreams	U.S.	1
News	China	1
News	U.K.	1
...	...	...
Search engine	U.S.	1
	UK	1
	Ukraine	1
	Vietnam	1
	Philippines	1
Search engine and web browser	Vietnam	1
Social Networking	Russia	1
Social media	U.S.	2
Social network	China	1
	Russia	1
	U.S.	2
Social network, photo sharing, blogging	Taiwan	1
Software and technology	Spain	1
Software and technology	U.S.	1
Software as a service	U.S.	1
Source code hosting service	U.S.	1
Streaming TV and movies	U.S.	1
Technology and software	U.S.	1
Television	China	1
Torrents	Costa Rica	1
URL shortening for links on Twitter	U.S.	1
Video sharing	China	2
	Japan	1
	U.S.	1
Weather forecasting	U.S.	1
Web directories	China	1
Web portal	Poland	1
Webpage transcoder for slow connections	U.S.	1
Wikis	U.S.	1
social news networking, entertainment	U.S.	1