Crawling Lesson Hints

Marvel Cinematic Universe

First list

CSS selector: th

This contains links to both characters and actors

Second list

CSS selector: .hatnote

This is a multi-page list of lists, where each article URL begins with https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_ and ends with any of the values from ['ABCDEFGHIJKLMNOPQRSTUVWXYZ'] and '0-9'. Therefore, we must construct a list of URLs of each individual list and do a multi-page retrieval. You can use the following code:

sections = [ letter for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"]
sections.append("0_9")
urls = [ "https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_" + section for section in sections ]

Visualization notes

This is an extremely dense graph, due to the number of links between any two articles. An undirected graph (which sums links back and forth between two articles as a single weighted edge) with a minimum weight greater than 1 helps cut down on clutter. In addition, a spring_layout creates some interesting groupings of individual articles.

Data files

mcu_network.json

BET Hip Hop Award Winners

First list:

List of all Hip Hop musicians

CSS selector: li

This contains links to all Hip Hop musicians with a Wiki article.

Second list:

List of all BET Hip Hop awards

CSS selector: li

This contains links to all BET Award Winner musicians and the names of the works for which they won an award.

Crawling notes

Crawling and saving the graph as an undirected graph and as a directed graph can generate extremely different visualizations.

Visualization notes

When the graph is flattened as a directed graph and then visualized as a directed graph with minimum weight of 2, a number of artists are forced to outside of the graph in a ring shape.

Data files

bet_directed.json and bet_undirected.json

NBA All Stars

List of NBA All-Stars

CSS selector: .fn

This contains links to NBA All-Stars.

Crawling notes

There are 408 All-stars, so you must set max_articles=408. In addition, articles about basketball players often contain dozens of links to other players that may not be relevant for building a social network. Therefore, setting selector="p" as a crawl setting limits relevant links to only links that appear in paragraph text.

Visualization notes

When creating the graph, setting minimum_weight=2 seems to cluster basketball players together that were active during the same decades.

Data files

nba_allstars.json

NFL Players

List of NFL Players by number of games played

CSS selector: .fn

This contains links to the approximately 300 NFL players who have played the most games.

Crawling notes

There are 300 players, so you must set max_articles=300. Setting selector="p" as a crawl setting limits relevant links to only links that appear in paragraph text.

Visualization notes

The graph is very sparse with few connections between players. Setting minimum_weight=2 creates an interesting cluster of career quarterbacks at the center of the graph.

Data files

nfl_players.json

Overwatch Characters

The Overwatch Wiki has some special requirements for crawling. Because of the structure of the Heroes page does not have any convenient CSS selectors for hero links, it is easier to hand code the URLs for all 24 heroes.

# Note: Lucio and Torbjorn have special characters in their names. You can inspect their URLs for their names
heroes = [ "Genji", "McCree", "Pharah", "Reaper", "Soldier:_76", "Sombra", "Tracer",
         "Bastion", "Hanzo", "Junkrat", "Mei", "Torbj%C3%B6rn", "Widowmaker", 
         "D.Va", "Orisa", "Reinhardt", "Roadhog", "Winston", "Zarya",
         "Ana", "L%C3%BAcio", "Mercy", "Symmetra", "Zenyatta"]
urls = [ "/wiki/" + hero for hero in heroes ]

Crawling notes

The Overwatch Wiki is not a Wikipedia wiki. Therefore, we have to change a few of the crawling options. Specifically, you must set host="https://overwatch.wikia.com", title_selector="h1". In addition, you should also set selector="p" due to the fact that each article actually contains a link to every other hero.

Visualization notes

The Overwatch Wiki has very sparse information about each hero. You may set minimum_weight=1. The groupings will not be strong, nor will the node sizes show a lot of variation. However, you can see individuals whose narratives are tied to each other. For example, members of Talon may appear close to each other.

Data files

overwatch.json

Forbes 400

Crawling notes

The Forbes 400 Wikipedia entry is extremely incomplete. To create the crawl list, you can text mine the Forbes 400 list directly. Each entry is contained inside of a strong tag and matches a pattern that begins with a number, followed by a period. To convert this data into a useable list of Wikipedia articles, use the following code:

from pyquery import PyQuery

# Make a list of URLs to mine
urls = [ "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/#3381da7c22f4",
         "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/2/#1bf172cb7b17", 
         "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/3/#5722e9247c58", 
         "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/4/#3262ecd31473"]

strongs = list()

# Get each "strong" HTML tag from each url
for url in urls:
    strongs.extend(PyQuery(url=url)("strong"))

# Use regular expressions to do the heavy lifting
import re

# This regex matches any number of digits followed by a period and a space, then accepts the rest of the string
# For a full explanation of this regex, see https://regex101.com/r/L2ZNig/2
regex = re.compile("^\d+\. .+")

# Use another regex to delete the list number, replace any spaces with an underscore
forbes_400 = [ "/wiki/" + re.sub("^\d+\. ", "", strong.text).replace(" ", "_") \
                for strong in strongs if strong.text and regex.match(strong.text) ]

print forbes_400

You may then perform the crawl with max_articles=400.

Note: Not all entries will be real Wikipedia entries. This code may generate some error messages during the crawl.

Visualization notes

When creating the graph, a minimum_weight=3 will create clusters of individuals with business or family ties to each other.

Data files

forbes_400.json