In [1]:
from __future__ import print_function
In [2]:
import sys
'Python version: %s.%s' % (sys.version_info.major, sys.version_info.minor)
Out[2]:
In [3]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import graphviz
In [4]:
print('Requests: %s' % requests.__version__)
print('BeautifulSoup: %s'% bs4.__version__)
print('Pandas: %s'% pd.__version__)
print('Graphviz: %s' % graphviz.__version__)
%matplotlib inline
A rich sitemap might contain page descriptions and modification dates along with image and video metadata, but the basic purpose of a sitemap is to provide a list the pages on a domain that are accessible to users and web crawlers. In this post, we'll use Python and a toolkit of libraries to parse, categorize, and visualize an XML sitemap. This will involve:
The scripts in this post are compatible with Python 2 and 3 and the library dependencies are Requests and BeautifulSoup4 for extracting the URLs, Pandas for categorization, and Graphviz for creating the visual sitemap. Once you have Python, these libraries can most likely be installed on any operating system with the following terminal commands:
pip install requests
pip install beautifulsoup4
pip install pandas
The Graphviz library is more difficult to install. On Mac it can be done with the help of homebrew:
brew install graphviz
pip install graphviz
For other operating systems or alternate methods, check out the installation instructions in the documentation.
We'll use the www.sportchek.ca sitemap as an example. It is hosted on their domain and open to the public. Like most large sites, the entire sitemap is split across multiple XML files. These are indexed at the /sitemap.xml page.
We start by opening the url in Python using requests and then instantiate a "soup" object containing the page content.
In [5]:
url = 'https://www.sportchek.ca/sitemap.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))
Next we can pull the XML sitemap links, which live within the <loc>
tags.
In [6]:
sitemap_index.findAll('loc')
Out[6]:
In [7]:
urls = [element.text for element in sitemap_index.findAll('loc')]
urls
Out[7]:
With some investigation of the XML format for each file above, we again see that URLs can be identified by searching for <loc>
tags. These URLs can be extracted the same as the XML links were from the index. We loop over the XML documents, appending all sitemap URLs to a list.
In [8]:
%%time
def extract_links(url):
''' Open an XML sitemap and find content wrapped in <loc> tags. '''
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
links = [element.text for element in soup.findAll('loc')]
return links
sitemap_urls = []
for url in urls:
links = extract_links(url)
sitemap_urls += links
In [9]:
'Found {:,} URLs in the sitemap'.format(len(sitemap_urls))
Out[9]:
Let's write these to a file that can be opened in Excel.
In [10]:
with open('sitemap_urls.dat', 'w') as f:
for url in sitemap_urls:
f.write(url + '\n')
Let's start by loading in the URLs we wrote to a file.
In [11]:
sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))
Site-specific categorization such as identifying display listing pages and product pages can be done by applying filters over the URL list. Python is great for this because filters can be very detailed and chained together, plus your results can be reproduced by simply running the script!
On the other hand, we could take a different approach and - instead of filtering for specific URLs - apply an automated algorithm to peel back our sites layers and find the general structure.
In [12]:
def peel_layers(urls, layers=3):
''' Builds a dataframe containing all unique page identifiers up
to a specified depth and counting the number of sub-pages for each.
Prints results to a CSV file.
urls : list
List of page URLs.
layers : int
Depth of automated URL search. Large values for this parameter
may cause long runtimes depending on the number of URLs.
'''
# Store results in a dataframe
sitemap_layers = pd.DataFrame()
# Get base levels
bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
sitemap_layers[0] = bases
# Get specified number of layers
for layer in range(1, layers+1):
page_layer = []
for url, base in zip(urls, bases):
try:
page_layer.append(url.split(base)[-1].split('/')[layer])
except:
# There is nothing that deep!
page_layer.append('')
sitemap_layers[layer] = page_layer
# Count and drop duplicate rows + sort
sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
.rename('counts').reset_index()\
.sort_values('counts', ascending=False)\
.sort_values(list(range(0, layers)), ascending=True)\
.reset_index(drop=True)
# Convert column names to string types and export
sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
sitemap_layers.to_csv('sitemap_layers.csv', index=False)
# Return the dataframe
return sitemap_layers
The peel_layers
function also counts the number of pages for each layer. These can be accessed by looking at the output dataframe in Python or opening the output file sitemap_layers.csv
in Excel. Let's do this for three layers.
In [13]:
sitemap_layers = peel_layers(urls=sitemap_urls, layers=3)
At this point you may be inclined to continue with further analysis in Excel, but we'll invite you to carry on in Python.
The peel_layers
function returns a Pandas DataFrame that we stored in the variable sitemap_layers
. This contains the exported .csv
data as a table inside Python, and it can be filtered or otherwise modified in any way. Say, for example, we are interested in the number of pages relating to hockey. We may want to run a script like this one that searches for rows with "hockey" in the third layer:
In [14]:
counts = 0
for row in sitemap_layers.values:
# Check if the word "hockey" is contained in the 3rd layer
if 'hockey' in row[3]:
# Add the page counts value from the outer right column
counts += row[-1]
print('%d total hockey pages' % counts)
This could also be accomplished in a single line.
In [15]:
counts = sitemap_layers[sitemap_layers['3'].apply(
lambda string: 'hockey' in string)]\
['counts'].sum()
print('%d total hockey pages' % counts)
What we do here is filter the dataframe (as seen below) and then sum the counts column.
In [16]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: 'hockey' in string)]
sitemap_fltr
Out[16]:
This table can be saved to an Excel readable format using the to_csv
function.
In [17]:
sitemap_fltr.to_csv('hockey_pages.csv', index=False)
Filtering conditions can be as specific as you desire. For example if you want to find snowboard and ski pages:
In [18]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: 'ski' in string or\
'snowboard' in string)]
sitemap_fltr
Out[18]:
Oops, it looks like "skills-development" is included as is contains "ski". Let's exclude this term.
In [19]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: ('ski' in string or\
'snowboard' in string)\
and 'skills-dev' not in string)]
sitemap_fltr
Out[19]:
Other useful filtering tools are the split
and len
functions. For instance, we could find all the pages with at least four "-" characters in the 3rd layer.
In [20]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: len(string.split('-')) >= 4)]
sitemap_fltr
Out[20]:
In this example, we split the string into a list of substrings as separated by the dashes and check if the list has more than 3 elements.
Working with Pandas DataFrames in Python can seem very complicated - especially for those new to Python - but the rewards are great.
Storing data in tables is the only reasonable option, but it's not always the best way to view the data. This is especially true when sharing it with others.
The sitemap dataframe we've generated can be nicely visualized using graphviz, where paths are illustrated with nodes and edges. The nodes contain site page layers and the edges are labelled by the number of sub-pages existing within that path.
In [21]:
def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
''' Make a sitemap graph up to a specified layer depth.
sitemap_layers : DataFrame
The dataframe created by the peel_layers function
containing sitemap information.
layers : int
Maximum depth to plot.
limit : int
The maximum number node edge connections. Good to set this
low for visualizing deep into site maps.
'''
# Check to make sure we are not trying to plot too many layers
if layers > len(df) - 1:
layers = len(df)-1
print('There are only %d layers available to plot, setting layers=%d'
% (layers, layers))
# Initialize graph
f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
f.body.extend(['rankdir=LR', 'size="%s"' % size])
def add_branch(f, names, vals, limit, connect_to=''):
''' Adds a set of nodes and edges to nodes on the previous layer. '''
# Get the currently existing node names
node_names = [item.split('"')[1] for item in f.body if 'label' in item]
# Only add a new branch it it will connect to a previously created node
if connect_to:
if connect_to in node_names:
for name, val in list(zip(names, vals))[:limit]:
f.node(name='%s-%s' % (connect_to, name), label=name)
f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))
f.attr('node', shape='rectangle') # Plot nodes as rectangles
# Add the first layer of nodes
for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
.sort_values(['counts'], ascending=False).values:
f.node(name=name, label='{} ({:,})'.format(name, counts))
if layers == 0:
return f
f.attr('node', shape='oval') # Plot nodes as ovals
f.graph_attr.update()
# Loop over each layer adding nodes and edges to prior nodes
for i in range(1, layers+1):
cols = [str(i_) for i_ in range(i)]
for k in df[cols].drop_duplicates().values:
# Compute the mask to select correct data
mask = True
for j, ki in enumerate(k):
mask &= df[str(j)] == ki
# Select the data then count branch size, sort, and truncate
data = df[mask].groupby([str(i)])['counts'].sum()\
.reset_index().sort_values(['counts'], ascending=False)
# Add to the graph
add_branch(f,
names=data[str(i)].values,
vals=data['counts'].values,
limit=limit,
connect_to='-'.join(['%s']*i) % tuple(k))
return f
def apply_style(f, style, title=''):
''' Apply the style and add a title if desired. More styling options are
documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
f : graphviz.dot.Digraph
The graph object as created by graphviz.
style : str
Available styles: 'light', 'dark'
title : str
Optional title placed at the bottom of the graph.
'''
dark_style = {
'graph': {
'label': title,
'bgcolor': '#3a3a3a',
'fontname': 'Helvetica',
'fontsize': '18',
'fontcolor': 'white',
},
'nodes': {
'style': 'filled',
'color': 'white',
'fillcolor': 'black',
'fontname': 'Helvetica',
'fontsize': '14',
'fontcolor': 'white',
},
'edges': {
'color': 'white',
'arrowhead': 'open',
'fontname': 'Helvetica',
'fontsize': '12',
'fontcolor': 'white',
}
}
light_style = {
'graph': {
'label': title,
'fontname': 'Helvetica',
'fontsize': '18',
'fontcolor': 'black',
},
'nodes': {
'style': 'filled',
'color': 'black',
'fillcolor': '#dbdddd',
'fontname': 'Helvetica',
'fontsize': '14',
'fontcolor': 'black',
},
'edges': {
'color': 'black',
'arrowhead': 'open',
'fontname': 'Helvetica',
'fontsize': '12',
'fontcolor': 'black',
}
}
if style == 'light':
apply_style = light_style
elif style == 'dark':
apply_style = dark_style
f.graph_attr = apply_style['graph']
f.node_attr = apply_style['nodes']
f.edge_attr = apply_style['edges']
return f
The code that builds and exports this visualization is contained within a function called make_sitemap_graph
that takes in our data and the number of layers deep we wish to see. For example we can do:
In [22]:
f = make_sitemap_graph(sitemap_layers, layers=2)
f = apply_style(f, 'light', title='Sport Check Sitemap')
f.render(cleanup=True)
f
Out[22]:
Or we can use the dark style:
In [23]:
f = make_sitemap_graph(sitemap_layers, layers=2)
f = apply_style(f, 'dark')
f.render(cleanup=True)
f
Out[23]:
Setting layers=3
we see that our graph is already very large! Here we set size=35
to create a higher resolution PDF file where the details are clearly visible.
In [24]:
f = make_sitemap_graph(sitemap_layers, layers=3, size='35')
f = apply_style(f, 'light')
f.render(cleanup=True)
f
Out[24]:
Another useful feature built into the graphing script is the ability to limit branch size. This can let us create deep sitemap visualizations that don't grow out of control. For example, limiting each branch to the top three (in terms of recursive page count):
In [25]:
sitemap_layers = peel_layers(urls=sitemap_urls, layers=5)
f = make_sitemap_graph(sitemap_layers, layers=5, limit=3, size='25')
f = apply_style(f, 'light')
f.render(cleanup=True)
f
Out[25]:
In this post we have shown how Python can be used to extract, categorize and visualize an XML sitemap. The code we looked at has been optimized for your personal use and aggregated into three Python scripts that can be downloaded here:
These can be leveraged to automate URL extraction, categorization and visualization for the sitemap of your choice. For usage instructions, please make sure to check out the online source code repository.
Thanks for reading, we hope you found this tutorial useful. If you run into any problems using our automated XML sitemap retrieval scripts we are here to help! You can reach us on twitter @ayima.
In [ ]: