We're going to create a social network of characters in the Marvel Cinematic Universe. You are looking at a Jupyter notebook. Each section is a cell that can contain text or Python code. You can run a cell by selecting it, and hitting Ctrl-Enter
. You will see the results of your code as it runs. Try running the cell below.
In [15]:
import wikinetworking as wn
import networkx as nx
from pyquery import PyQuery
%matplotlib inline
print "OK"
You just ran some Python code that imports packages. Packages are pre-written Python code. The wikinetworking
package contains code for crawling, text mining and graphing Wiki articles. You can access these functions in the wn
object.
Our first step is getting a list of links that we want to crawl. Wikipedia has article data organized as lists of topics. (There are many that may not be listed on this page. You should search for one that works for you.) Once you find the URL of a list that contains articles you would like to crawl, paste the URL in the variable below.
In [ ]:
url = "https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_film_actors"
print url
Now we can download the article and get a list of links from it.
In [ ]:
links = wn.filter_links(PyQuery(url=url))
print links
Many of these links may not be relevant to our topic. We can filter for links that exist inside certain types of HTML elements. You can find the type of element by inspecting a relevant link on your Wikipedia page in your browser. We can use a special type of filter called a CSS selector to get only links that are inside of specific types of elements.
In [ ]:
selector="th"
links = wn.filter_links(PyQuery(url=url), selector=selector)
print links
In [ ]:
another_url = ""
another_selector = ""
more_links = wn.filter_links(PyQuery(url=another_url), selector=another_selector)
print more_links
What if you need links from a lists of lists? You can automatically crawl a list of URLs, as well. First we need to generate a list of URLs.
In [ ]:
url_pattern = "https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_"
sections = [letter for letter in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
sections.append('0-9')
many_urls = [url_pattern + section for section in sections]
print many_urls
And then we can crawl this list of URLs.
In [ ]:
selector = ".hatnote"
more_links = wn.retrieve_multipage(many_urls, selector=selector, verbose=True)
Now that we have a second set of links, we can look for the intersection of the two lists. That should give us only the URLs we want.
In [ ]:
relevant_links = wn.intersection(links, more_links)
print relevant_links
Let's save these links into a file so we don't have to download the data again.
In [ ]:
wn.write_list(relevant_links, "relevant_links.txt")
Let's also make sure we can load the data after we've saved it.
In [ ]:
relevant_links = wn.read_list("relevant_links.txt")
print relevant_links
In [ ]:
starting_url="/wiki/Iron_Man"
raw_crawl_data = wn.crawl(starting_url, accept=relevant_links)
import json
print json.dumps(raw_crawl_data, sort_keys=True, indent=4)
We can "flatten" the data and save it for convenience.
In [ ]:
graph_data = wn.undirected_graph(raw_crawl_data)
import json
print json.dumps(graph_data, sort_keys=True, indent=4)
wn.save_dict(graph_data, "undirected_graph.json")
Now we can draw the graph. Go to Part 2 - Drawing the Network...