In [1]:
from __future__ import print_function

In [2]:
import sys
'Python version: %s.%s' % (sys.version_info.major, sys.version_info.minor)

'Python version: 3.5'

In [3]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import graphviz

In [4]:
print('Requests: %s' % requests.__version__)
print('BeautifulSoup: %s'% bs4.__version__)
print('Pandas: %s'% pd.__version__)
print('Graphviz: %s' % graphviz.__version__)
%matplotlib inline

Requests: 2.11.1
BeautifulSoup: 4.5.1
Pandas: 0.18.1
Graphviz: 0.5.1

How to visualize an XML sitemap using Python

A rich sitemap might contain page descriptions and modification dates along with image and video metadata, but the basic purpose of a sitemap is to provide a list the pages on a domain that are accessible to users and web crawlers. In this post, we'll use Python and a toolkit of libraries to parse, categorize, and visualize an XML sitemap. This will involve:

  • extracting the page URLs
  • categorizing URLs by page type
  • plotting a sitemap graph tree

The scripts in this post are compatible with Python 2 and 3 and the library dependencies are Requests and BeautifulSoup4 for extracting the URLs, Pandas for categorization, and Graphviz for creating the visual sitemap. Once you have Python, these libraries can most likely be installed on any operating system with the following terminal commands:

pip install requests pip install beautifulsoup4 pip install pandas

The Graphviz library is more difficult to install. On Mac it can be done with the help of homebrew:

brew install graphviz pip install graphviz

For other operating systems or alternate methods, check out the installation instructions in the documentation.

Extracting URLs

We'll use the sitemap as an example. It is hosted on their domain and open to the public. Like most large sites, the entire sitemap is split across multiple XML files. These are indexed at the /sitemap.xml page.

We start by opening the url in Python using requests and then instantiate a "soup" object containing the page content.

In [5]:
url = ''
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

Loaded page with: <Response [200]>
Created <class 'bs4.BeautifulSoup'> object

Next we can pull the XML sitemap links, which live within the <loc> tags.

In [6]:


In [7]:
urls = [element.text for element in sitemap_index.findAll('loc')]


With some investigation of the XML format for each file above, we again see that URLs can be identified by searching for <loc> tags. These URLs can be extracted the same as the XML links were from the index. We loop over the XML documents, appending all sitemap URLs to a list.

In [8]:

def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in <loc> tags. '''
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]
    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

CPU times: user 16.9 s, sys: 336 ms, total: 17.3 s
Wall time: 24.6 s

In [9]:
'Found {:,} URLs in the sitemap'.format(len(sitemap_urls))

'Found 52,552 URLs in the sitemap'

Let's write these to a file that can be opened in Excel.

In [10]:
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')


Let's start by loading in the URLs we wrote to a file.

In [11]:
sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))

Loaded 52,552 URLs

Site-specific categorization such as identifying display listing pages and product pages can be done by applying filters over the URL list. Python is great for this because filters can be very detailed and chained together, plus your results can be reproduced by simply running the script!

On the other hand, we could take a different approach and - instead of filtering for specific URLs - apply an automated algorithm to peel back our sites layers and find the general structure.

In [12]:
def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counting the number of sub-pages for each.
    Prints results to a CSV file.

    urls : list
        List of page URLs.

    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
                # There is nothing that deep!

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers

The peel_layers function also counts the number of pages for each layer. These can be accessed by looking at the output dataframe in Python or opening the output file sitemap_layers.csv in Excel. Let's do this for three layers.

In [13]:
sitemap_layers = peel_layers(urls=sitemap_urls, layers=3)

At this point you may be inclined to continue with further analysis in Excel, but we'll invite you to carry on in Python.


The peel_layers function returns a Pandas DataFrame that we stored in the variable sitemap_layers. This contains the exported .csv data as a table inside Python, and it can be filtered or otherwise modified in any way. Say, for example, we are interested in the number of pages relating to hockey. We may want to run a script like this one that searches for rows with "hockey" in the third layer:

In [14]:
counts = 0
for row in sitemap_layers.values:

    # Check if the word "hockey" is contained in the 3rd layer
    if 'hockey' in row[3]:
        # Add the page counts value from the outer right column
        counts += row[-1]

print('%d total hockey pages' % counts)

3014 total hockey pages

This could also be accomplished in a single line.

In [15]:
counts = sitemap_layers[sitemap_layers['3'].apply(
            lambda string: 'hockey' in string)]\
print('%d total hockey pages' % counts)

3014 total hockey pages

What we do here is filter the dataframe (as seen below) and then sum the counts column.

In [16]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: 'hockey' in string)]

0 1 2 3 counts
40 categories equipment hockey 1276
59 categories fan-shop international-hockey 160
66 categories shop-by-sport hockey 1578

This table can be saved to an Excel readable format using the to_csv function.

In [17]:
sitemap_fltr.to_csv('hockey_pages.csv', index=False)

Filtering conditions can be as specific as you desire. For example if you want to find snowboard and ski pages:

In [18]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: 'ski' in string or\
                                                                       'snowboard' in string)]

0 1 2 3 counts
12 categories deals-and-features prior-snowboard-clearance 230
30 categories deals-and-features junior-ski-package 2
31 categories deals-and-features junior-snowboard-package 1
39 categories electronics skills-development 10
41 categories equipment alpine-skiing 700
42 categories equipment snowboarding 621

Oops, it looks like "skills-development" is included as is contains "ski". Let's exclude this term.

In [19]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: ('ski' in string or\
                                                                        'snowboard' in string)\
                                                                        and 'skills-dev' not in string)]

0 1 2 3 counts
12 categories deals-and-features prior-snowboard-clearance 230
30 categories deals-and-features junior-ski-package 2
31 categories deals-and-features junior-snowboard-package 1
41 categories equipment alpine-skiing 700
42 categories equipment snowboarding 621

Other useful filtering tools are the split and len functions. For instance, we could find all the pages with at least four "-" characters in the 3rd layer.

In [20]:
sitemap_fltr = sitemap_layers[sitemap_layers['3'].apply(lambda string: len(string.split('-')) >= 4)]

0 1 2 3 counts
15 categories deals-and-features rule-the-winter-collections 142
32 categories electronics trackers-watches-heart-rate 200

In this example, we split the string into a list of substrings as separated by the dashes and check if the list has more than 3 elements.

Working with Pandas DataFrames in Python can seem very complicated - especially for those new to Python - but the rewards are great.

Visualizing sitemap

Storing data in tables is the only reasonable option, but it's not always the best way to view the data. This is especially true when sharing it with others.

The sitemap dataframe we've generated can be nicely visualized using graphviz, where paths are illustrated with nodes and edges. The nodes contain site page layers and the edges are labelled by the number of sub-pages existing within that path.

In [21]:
def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function 
        containing sitemap information.
    layers : int
        Maximum depth to plot.
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))
    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])
    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''
        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]
        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))

    f.attr('node', shape='rectangle') # Plot nodes as rectangles
    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))
    if layers == 0:
        return f
    f.attr('node', shape='oval') # Plot nodes as ovals
    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        for k in df[cols].drop_duplicates().values:
            # Compute the mask to select correct data
            mask = True
            for j, ki in enumerate(k):
                mask &= df[str(j)] == ki
            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)
            # Add to the graph
                       connect_to='-'.join(['%s']*i) % tuple(k))
    return f

def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here:
    f :
        The graph object as created by graphviz.
    style : str
        Available styles: 'light', 'dark'
    title : str
        Optional title placed at the bottom of the graph.
    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
    if style == 'light':
        apply_style = light_style
    elif style == 'dark':
        apply_style = dark_style
    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']
    return f

The code that builds and exports this visualization is contained within a function called make_sitemap_graph that takes in our data and the number of layers deep we wish to see. For example we can do:

In [22]:
f = make_sitemap_graph(sitemap_layers, layers=2)
f = apply_style(f, 'light', title='Sport Check Sitemap')

sitemap Sport Check Sitemap (52,552) categories> 52,552 deals-and-features> 21,223 shop-by-sport> 8,533 fan-shop> 5,409 men> 5,157 equipment> 4,416 women> 2,762 kids> 2,289 accessories> 1,873 electronics> 569 boxing-day-deals> 320 sneaker-launches> 1

Or we can use the dark style:

In [23]:
f = make_sitemap_graph(sitemap_layers, layers=2)
f = apply_style(f, 'dark')

sitemap (52,552) categories> 52,552 deals-and-features> 21,223 shop-by-sport> 8,533 fan-shop> 5,409 men> 5,157 equipment> 4,416 women> 2,762 kids> 2,289 accessories> 1,873 electronics> 569 boxing-day-deals> 320 sneaker-launches> 1

Setting layers=3 we see that our graph is already very large! Here we set size=35 to create a higher resolution PDF file where the details are clearly visible.

In [24]:
f = make_sitemap_graph(sitemap_layers, layers=3, size='35')
f = apply_style(f, 'light')

sitemap (52,552) categories> 52,552 deals-and-features> 21,223 shop-by-sport> 8,533 fan-shop> 5,409 men> 5,157 equipment> 4,416 women> 2,762 kids> 2,289 accessories> 1,873 electronics> 569 boxing-day-deals> 320 sneaker-launches> 1 search-results> 17,933 wk45-fanwear> 648 holiday-gift-guide> 508 winter-destination> 449 wk45-equipment-deals> 236 prior-snowboard-clearance> 230 wk45-kids-deals> 175 wk45-mens-deals> 147 rule-the-winter-collections> 142 wk45-advertised> 123 wk46-advertised> 105 wk45-womens-deals> 102 rule-the-winter> 78 wk35-new-arrivals> 73 wk46-equipment-deals> 70 wk46-mens-deals> 70 wk46-kids-deals> 46 let-it-rain> 28 skull-candy> 17 everyday-athletics> 15 fitbit-deals> 11 wk46-womens-deals> 8 triggerpoint-equipment> 6 junior-ski-package> 2 junior-snowboard-package> 1 fitness-aerobics> 2,286 hockey> 1,578 golf> 1,084 cycling> 731 baseball-softball> 621 football> 595 soccer> 554 basketball> 372 racquet-sports> 350 running> 125 hiking-camping> 116 volleyball> 60 lacrosse> 29 rugby> 20 curling> 12 nhl> 2,097 mlb> 784 nba> 747 nfl> 716 cup-soccer> 281 mls> 225 cfl> 187 champions-league-soccer> 181 international-hockey> 160 canadian-olympic-team> 31 apparel> 3,660 footwear> 1,497 hockey> 1,276 alpine-skiing> 700 snowboarding> 621 cycling> 506 snowsports> 321 fitness> 296 swim> 211 inline-skates-skateboards> 197 racquet-sports> 175 team-sports> 103 recreational-skates> 10 apparel> 1,594 footwear> 1,168 apparel> 1,217 footwear> 1,072 bags> 1,033 sunglasses> 650 games> 89 water-bottles-hydration> 71 miscellaneous> 30 trackers-watches-heart-rate> 200 cameras> 150 watches-timing> 105 headphones-speakers> 47 handheld-gps> 27 golf-electronics> 19 health-management> 11 skills-development> 10 fanwear> 316 doorcrashers> 4 sneaker-releases-collection> 1

Another useful feature built into the graphing script is the ability to limit branch size. This can let us create deep sitemap visualizations that don't grow out of control. For example, limiting each branch to the top three (in terms of recursive page count):

In [25]:
sitemap_layers = peel_layers(urls=sitemap_urls, layers=5)
f = make_sitemap_graph(sitemap_layers, layers=5, limit=3, size='25')
f = apply_style(f, 'light')

sitemap (52,552) categories> 52,552 deals-and-features> 21,223 shop-by-sport> 8,533 fan-shop> 5,409 search-results> 17,933 wk45-fanwear> 648 holiday-gift-guide> 508 fitness-aerobics> 2,286 hockey> 1,578 golf> 1,084 nhl> 2,097 mlb> 784 nba> 747 clothing> 8,477 shoes> 4,806 outerwear> 1,506 product> 648 over-150> 78 womens-gifts> 74 mens-gifts> 55 canadian-nhl> 647 toronto-maple-leafs> 332 edmonton-oilers> 172 product> 554 toronto-blue-jays> 220 detroit-tigers> 8 product> 384 toronto-raptors> 284 chicago-bulls> 19 athletic-apparel> 1,103 back-to-the-gym> 372 bags-packs> 345 hockey-sticks> 635 protective-equipment> 434 goalie-equipment> 235 golf-apparel> 232 golf-bags-carts> 187 mens-golf-clubs> 187 product> 56 golf> 5 sports-fanwear> 5 product> 29 tech> 16 clothing> 10 product> 20 clothing> 16 ski-snowboard> 7 product> 8,477 footwear> 2,766 footwear1> 1,186 footwear2> 854 product> 1,506 bc-lions-full-zip-hoodie-332147937.html> 1 toronto-maple-leafs-replay-womens-tee-332157543.html> 1 toronto-maple-leafs-locker-room-adjustable-slouch-cap-332088127.html> 1 arizona-diamondbacks-neo-flex-3930-cap-330836090.html> 1 toronto-blue-jays-price-replica-player-youth-jersey-331956218.html> 1 toronto-blue-jays-preferred-pick-womens-cap-pink-332078224.html> 1 product> 78 jerseys> 70 clothing> 59 product> 8 2016-all-star-cn-circle-script-tee-331910380.html> 1 2016-all-star-cn-hoody-331910339.html> 1 toronto-raptors-kyle-lowry-toddler-away-jersey-red-331898232.html> 1 product> 266 jerseys> 18 product> 19 product> 647 product> 323 toronto-maple-leafs-centennial-collection> 9 product> 172 womens> 709 mens> 363 training> 31 back-to-the-gym-gps-sport-watches> 140 back-to-the-gym-headphones> 73 back-to-the-gym-activity-trackers> 60 casual-backpacks> 118 sport-bags> 107 carry-waist-bags> 70 mens-golf-apparel> 147 womens-golf-apparel> 79 kids-golf-apparel> 6 cart-bags> 147 travel-covers> 24 carts> 16 putters> 40 hybrids> 37 golf-sets> 30 composite-one-piece> 590 goalie-sticks> 41 wood-sticks> 3 protective-accessories> 89 helmets-and-cages> 70 elbow-pads> 58 goalie-blockers-and-catchers> 81 goalie-mix-and-match> 47 goalie-pads> 42

Summary and Script Usage

In this post we have shown how Python can be used to extract, categorize and visualize an XML sitemap. The code we looked at has been optimized for your personal use and aggregated into three Python scripts that can be downloaded here:

XML sitemap extraction tool

These can be leveraged to automate URL extraction, categorization and visualization for the sitemap of your choice. For usage instructions, please make sure to check out the online source code repository.

Thanks for reading, we hope you found this tutorial useful. If you run into any problems using our automated XML sitemap retrieval scripts we are here to help! You can reach us on twitter @ayima.

In [ ]: