CSS selector: th
This contains links to both characters and actors
List of all Marvel Comics characters
CSS selector: .hatnote
This is a multi-page list of lists, where each article URL begins with https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_
and ends with any of the values from ['ABCDEFGHIJKLMNOPQRSTUVWXYZ'] and '0-9'. Therefore, we must construct a list of URLs of each individual list and do a multi-page retrieval. You can use the following code:
sections = [ letter for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"]
sections.append("0_9")
urls = [ "https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_" + section for section in sections ]
This is an extremely dense graph, due to the number of links between any two articles. An undirected graph (which sums links back and forth between two articles as a single weighted edge) with a minimum weight greater than 1 helps cut down on clutter. In addition, a spring_layout
creates some interesting groupings of individual articles.
CSS selector: li
This contains links to all Hip Hop musicians with a Wiki article.
List of all BET Hip Hop awards
CSS selector: li
This contains links to all BET Award Winner musicians and the names of the works for which they won an award.
Crawling and saving the graph as an undirected graph and as a directed graph can generate extremely different visualizations.
When the graph is flattened as a directed graph and then visualized as a directed graph with minimum weight of 2, a number of artists are forced to outside of the graph in a ring shape.
CSS selector: .fn
This contains links to NBA All-Stars.
There are 408 All-stars, so you must set max_articles=408
. In addition, articles about basketball players often contain dozens of links to other players that may not be relevant for building a social network. Therefore, setting selector="p"
as a crawl setting limits relevant links to only links that appear in paragraph text.
When creating the graph, setting minimum_weight=2
seems to cluster basketball players together that were active during the same decades.
List of NFL Players by number of games played
CSS selector: .fn
This contains links to the approximately 300 NFL players who have played the most games.
There are 300 players, so you must set max_articles=300
. Setting selector="p"
as a crawl setting limits relevant links to only links that appear in paragraph text.
The graph is very sparse with few connections between players. Setting minimum_weight=2
creates an interesting cluster of career quarterbacks at the center of the graph.
The Overwatch Wiki has some special requirements for crawling. Because of the structure of the Heroes page does not have any convenient CSS selectors for hero links, it is easier to hand code the URLs for all 24 heroes.
# Note: Lucio and Torbjorn have special characters in their names. You can inspect their URLs for their names
heroes = [ "Genji", "McCree", "Pharah", "Reaper", "Soldier:_76", "Sombra", "Tracer",
"Bastion", "Hanzo", "Junkrat", "Mei", "Torbj%C3%B6rn", "Widowmaker",
"D.Va", "Orisa", "Reinhardt", "Roadhog", "Winston", "Zarya",
"Ana", "L%C3%BAcio", "Mercy", "Symmetra", "Zenyatta"]
urls = [ "/wiki/" + hero for hero in heroes ]
The Overwatch Wiki is not a Wikipedia wiki. Therefore, we have to change a few of the crawling options. Specifically, you must set host="https://overwatch.wikia.com", title_selector="h1"
. In addition, you should also set selector="p"
due to the fact that each article actually contains a link to every other hero.
The Overwatch Wiki has very sparse information about each hero. You may set minimum_weight=1
. The groupings will not be strong, nor will the node sizes show a lot of variation. However, you can see individuals whose narratives are tied to each other. For example, members of Talon may appear close to each other.
The Forbes 400 Wikipedia entry is extremely incomplete. To create the crawl list, you can text mine the Forbes 400 list directly. Each entry is contained inside of a strong
tag and matches a pattern that begins with a number, followed by a period. To convert this data into a useable list of Wikipedia articles, use the following code:
from pyquery import PyQuery
# Make a list of URLs to mine
urls = [ "https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/#3381da7c22f4",
"https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/2/#1bf172cb7b17",
"https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/3/#5722e9247c58",
"https://www.forbes.com/sites/chasewithorn/2016/10/04/forbes-400-the-full-list-of-the-richest-people-in-america-2016/4/#3262ecd31473"]
strongs = list()
# Get each "strong" HTML tag from each url
for url in urls:
strongs.extend(PyQuery(url=url)("strong"))
# Use regular expressions to do the heavy lifting
import re
# This regex matches any number of digits followed by a period and a space, then accepts the rest of the string
# For a full explanation of this regex, see https://regex101.com/r/L2ZNig/2
regex = re.compile("^\d+\. .+")
# Use another regex to delete the list number, replace any spaces with an underscore
forbes_400 = [ "/wiki/" + re.sub("^\d+\. ", "", strong.text).replace(" ", "_") \
for strong in strongs if strong.text and regex.match(strong.text) ]
print forbes_400
You may then perform the crawl with max_articles=400
.
Note: Not all entries will be real Wikipedia entries. This code may generate some error messages during the crawl.
When creating the graph, a minimum_weight=3
will create clusters of individuals with business or family ties to each other.