In [1]:

    
from __future__ import print_function



In [2]:

    
import sys
'Python version: %s.%s' % (sys.version_info.major, sys.version_info.minor)









    Out[2]:





'Python version: 3.5'



In [3]:

    
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import graphviz



In [4]:

    
print('Requests: %s' % requests.__version__)
print('BeautifulSoup: %s'% bs4.__version__)
print('Pandas: %s'% pd.__version__)
print('Graphviz: %s' % graphviz.__version__)
%matplotlib inline









    



Requests: 2.11.1
BeautifulSoup: 4.5.1
Pandas: 0.18.1
Graphviz: 0.5.1

How to visualize an XML sitemap using Python

A rich sitemap might contain page descriptions and modification dates along with image and video metadata, but the basic purpose of a sitemap is to provide a list the pages on a domain that are accessible to users and web crawlers. In this post, we'll use Python and a toolkit of libraries to parse, categorize, and visualize an XML sitemap. This will involve:

extracting the page URLs
categorizing URLs by page type
plotting a sitemap graph tree

The scripts in this post are compatible with Python 2 and 3 and the library dependencies are Requests and BeautifulSoup4 for extracting the URLs, Pandas for categorization, and Graphviz for creating the visual sitemap. Once you have Python, these libraries can most likely be installed on any operating system with the following terminal commands:

pip install requests pip install beautifulsoup4 pip install pandas

The Graphviz library is more difficult to install. On Mac it can be done with the help of homebrew:

brew install graphviz pip install graphviz

For other operating systems or alternate methods, check out the installation instructions in the documentation.

Extracting URLs

We'll use the www.sportchek.ca sitemap as an example. It is hosted on their domain and open to the public. Like most large sites, the entire sitemap is split across multiple XML files. These are indexed at the /sitemap.xml page.

We start by opening the url in Python using requests and then instantiate a "soup" object containing the page content.



In [5]:

    
url = 'https://www.sportchek.ca/sitemap.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))









    



Loaded page with: <Response [200]>
Created <class 'bs4.BeautifulSoup'> object

Next we can pull the XML sitemap links, which live within the <loc> tags.



In [6]:

    
sitemap_index.findAll('loc')









    Out[6]:





[<loc>https://www.sportchek.ca/sitemap01.xml</loc>,
 <loc>https://www.sportchek.ca/sitemap02.xml</loc>,
 <loc>https://www.sportchek.ca/sitemap03.xml</loc>,
 <loc>https://www.sportchek.ca/sitemap04.xml</loc>,
 <loc>https://www.sportchek.ca/sitemap05.xml</loc>,
 <loc>https://www.sportchek.ca/sitemap06.xml</loc>,
 <loc>https://www.sportchek.ca/sitemap07.xml</loc>,
 <loc>https://www.sportchek.ca/sitemap08.xml</loc>]



In [7]:

    
urls = [element.text for element in sitemap_index.findAll('loc')]
urls









    Out[7]:





['https://www.sportchek.ca/sitemap01.xml',
 'https://www.sportchek.ca/sitemap02.xml',
 'https://www.sportchek.ca/sitemap03.xml',
 'https://www.sportchek.ca/sitemap04.xml',
 'https://www.sportchek.ca/sitemap05.xml',
 'https://www.sportchek.ca/sitemap06.xml',
 'https://www.sportchek.ca/sitemap07.xml',
 'https://www.sportchek.ca/sitemap08.xml']

With some investigation of the XML format for each file above, we again see that URLs can be identified by searching for <loc> tags. These URLs can be extracted the same as the XML links were from the index. We loop over the XML documents, appending all sitemap URLs to a list.



In [8]:

    
%%time

def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in <loc> tags. '''
    
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]
    
    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links









    



CPU times: user 16.9 s, sys: 336 ms, total: 17.3 s
Wall time: 24.6 s



In [9]:

    
'Found {:,} URLs in the sitemap'.format(len(sitemap_urls))









    Out[9]:





'Found 52,552 URLs in the sitemap'

Let's write these to a file that can be opened in Excel.



In [10]:

    
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

Categorization

Let's start by loading in the URLs we wrote to a file.



In [11]:

    
sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))









    



Loaded 52,552 URLs

Site-specific categorization such as identifying display listing pages and product pages can be done by applying filters over the URL list. Python is great for this because filters can be very detailed and chained together, plus your results can be reproduced by simply running the script!

On the other hand, we could take a different approach and - instead of filtering for specific URLs - apply an automated algorithm to peel back our sites layers and find the general structure.



In [12]:

    
def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counting the number of sub-pages for each.
    Prints results to a CSV file.

    urls : list
        List of page URLs.

    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.
    '''

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
            try:
                page_layer.append(url.split(base)[-1].split('/')[layer])
            except:
                # There is nothing that deep!
                page_layer.append('')

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .rename('counts').reset_index()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
                     .reset_index(drop=True)
    
    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers

The peel_layers function also counts the number of pages for each layer. These can be accessed by looking at the output dataframe in Python or opening the output file sitemap_layers.csv in Excel. Let's do this for three layers.



In [13]:

    
sitemap_layers = peel_layers(urls=sitemap_urls, layers=3)

At this point you may be inclined to continue with further analysis in Excel, but we'll invite you to carry on in Python.

Filtering

The peel_layers function returns a Pandas DataFrame that we stored in the variable sitemap_layers. This contains the exported .csv data as a table inside Python, and it can be filtered or otherwise modified in any way. Say, for example, we are interested in the number of pages relating to hockey. We may want to run a script like this one that searches for rows with "hockey" in the third layer:



In [14]:

    
counts = 0
for row in sitemap_layers.values:

    # Check if the word "hockey" is contained in the 3rd layer
    if 'hockey' in row[3]:
        # Add the page counts value from the outer right column
        counts += row[-1]

print('%d total hockey pages' % counts)









    



3014 total hockey pages

This could also be accomplished in a single line.



In [15]:

    
counts = sitemap_layers[sitemap_layers['3'].apply(
            lambda string: 'hockey' in string)]\
            ['counts'].sum()
    
print('%d total hockey pages' % counts)









    



3014 total hockey pages

What we do here is filter the dataframe (as seen below) and then sum the counts column.



In [16]:

    
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: 'hockey' in string)]
sitemap_fltr









    Out[16]:






  
    
      
      0
      1
      2
      3
      counts
    
  
  
    
      40
      www.sportchek.ca
      categories
      equipment
      hockey
      1276
    
    
      59
      www.sportchek.ca
      categories
      fan-shop
      international-hockey
      160
    
    
      66
      www.sportchek.ca
      categories
      shop-by-sport
      hockey
      1578

This table can be saved to an Excel readable format using the to_csv function.



In [17]:

    
sitemap_fltr.to_csv('hockey_pages.csv', index=False)

Filtering conditions can be as specific as you desire. For example if you want to find snowboard and ski pages:



In [18]:

    
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: 'ski' in string or\
                                                                       'snowboard' in string)]
sitemap_fltr









    Out[18]:






  
    
      
      0
      1
      2
      3
      counts
    
  
  
    
      12
      www.sportchek.ca
      categories
      deals-and-features
      prior-snowboard-clearance
      230
    
    
      30
      www.sportchek.ca
      categories
      deals-and-features
      junior-ski-package
      2
    
    
      31
      www.sportchek.ca
      categories
      deals-and-features
      junior-snowboard-package
      1
    
    
      39
      www.sportchek.ca
      categories
      electronics
      skills-development
      10
    
    
      41
      www.sportchek.ca
      categories
      equipment
      alpine-skiing
      700
    
    
      42
      www.sportchek.ca
      categories
      equipment
      snowboarding
      621

Oops, it looks like "skills-development" is included as is contains "ski". Let's exclude this term.



In [19]:

    
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: ('ski' in string or\
                                                                        'snowboard' in string)\
                                                                        and 'skills-dev' not in string)]
sitemap_fltr









    Out[19]:






  
    
      
      0
      1
      2
      3
      counts
    
  
  
    
      12
      www.sportchek.ca
      categories
      deals-and-features
      prior-snowboard-clearance
      230
    
    
      30
      www.sportchek.ca
      categories
      deals-and-features
      junior-ski-package
      2
    
    
      31
      www.sportchek.ca
      categories
      deals-and-features
      junior-snowboard-package
      1
    
    
      41
      www.sportchek.ca
      categories
      equipment
      alpine-skiing
      700
    
    
      42
      www.sportchek.ca
      categories
      equipment
      snowboarding
      621

Other useful filtering tools are the split and len functions. For instance, we could find all the pages with at least four "-" characters in the 3rd layer.



In [20]:

    
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: len(string.split('-')) >= 4)]
sitemap_fltr









    Out[20]:






  
    
      
      0
      1
      2
      3
      counts
    
  
  
    
      15
      www.sportchek.ca
      categories
      deals-and-features
      rule-the-winter-collections
      142
    
    
      32
      www.sportchek.ca
      categories
      electronics
      trackers-watches-heart-rate
      200

In this example, we split the string into a list of substrings as separated by the dashes and check if the list has more than 3 elements.

Working with Pandas DataFrames in Python can seem very complicated - especially for those new to Python - but the rewards are great.

Visualizing sitemap

Storing data in tables is the only reasonable option, but it's not always the best way to view the data. This is especially true when sharing it with others.

The sitemap dataframe we've generated can be nicely visualized using graphviz, where paths are illustrated with nodes and edges. The nodes contain site page layers and the edges are labelled by the number of sub-pages existing within that path.



In [21]:

    
def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function 
        containing sitemap information.
    
    layers : int
        Maximum depth to plot.
        
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    '''
    
    
    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))
    
    
    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])
    
    
    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''
        
        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]
        
        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))
        

    f.attr('node', shape='rectangle') # Plot nodes as rectangles
    
    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))
    
    if layers == 0:
        return f
    
    f.attr('node', shape='oval') # Plot nodes as ovals
    f.graph_attr.update()
    
    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        for k in df[cols].drop_duplicates().values:
            
            # Compute the mask to select correct data
            mask = True
            for j, ki in enumerate(k):
                mask &= df[str(j)] == ki
                
            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)
                
            # Add to the graph
            add_branch(f,
                       names=data[str(i)].values,
                       vals=data['counts'].values,
                       limit=limit,
                       connect_to='-'.join(['%s']*i) % tuple(k))
            
    return f


def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
    
    f : graphviz.dot.Digraph
        The graph object as created by graphviz.
    
    style : str
        Available styles: 'light', 'dark'
        
    title : str
        Optional title placed at the bottom of the graph.
    '''
    
    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        },
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        },
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',
        }
    }

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        },
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        },
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
        }
    }
    
    if style == 'light':
        apply_style = light_style
            
    elif style == 'dark':
        apply_style = dark_style
        
    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']
    
    return f

The code that builds and exports this visualization is contained within a function called make_sitemap_graph that takes in our data and the number of layers deep we wish to see. For example we can do:



In [22]:

    
f = make_sitemap_graph(sitemap_layers, layers=2)
f = apply_style(f, 'light', title='Sport Check Sitemap')
f.render(cleanup=True)
f









    Out[22]:

Or we can use the dark style:



In [23]:

    
f = make_sitemap_graph(sitemap_layers, layers=2)
f = apply_style(f, 'dark')
f.render(cleanup=True)
f









    Out[23]:

Setting layers=3 we see that our graph is already very large! Here we set size=35 to create a higher resolution PDF file where the details are clearly visible.



In [24]:

    
f = make_sitemap_graph(sitemap_layers, layers=3, size='35')
f = apply_style(f, 'light')
f.render(cleanup=True)
f









    Out[24]:

Another useful feature built into the graphing script is the ability to limit branch size. This can let us create deep sitemap visualizations that don't grow out of control. For example, limiting each branch to the top three (in terms of recursive page count):



In [25]:

    
sitemap_layers = peel_layers(urls=sitemap_urls, layers=5)
f = make_sitemap_graph(sitemap_layers, layers=5, limit=3, size='25')
f = apply_style(f, 'light')
f.render(cleanup=True)
f









    Out[25]:

Summary and Script Usage

In this post we have shown how Python can be used to extract, categorize and visualize an XML sitemap. The code we looked at has been optimized for your personal use and aggregated into three Python scripts that can be downloaded here:

XML sitemap extraction tool

These can be leveraged to automate URL extraction, categorization and visualization for the sitemap of your choice. For usage instructions, please make sure to check out the online source code repository.

Thanks for reading, we hope you found this tutorial useful. If you run into any problems using our automated XML sitemap retrieval scripts we are here to help! You can reach us on twitter @ayima.



In [ ]:

	0	1	2	3	counts
40	www.sportchek.ca	categories	equipment	hockey	1276
59	www.sportchek.ca	categories	fan-shop	international-hockey	160
66	www.sportchek.ca	categories	shop-by-sport	hockey	1578

	0	1	2	3	counts
12	www.sportchek.ca	categories	deals-and-features	prior-snowboard-clearance	230
30	www.sportchek.ca	categories	deals-and-features	junior-ski-package	2
31	www.sportchek.ca	categories	deals-and-features	junior-snowboard-package	1
39	www.sportchek.ca	categories	electronics	skills-development	10
41	www.sportchek.ca	categories	equipment	alpine-skiing	700
42	www.sportchek.ca	categories	equipment	snowboarding	621