An API is an Application Programming Interface, which is a standardized way for programs to communicate and share data with each other. Wikipedia runs on an open source platform called MediaWiki, as do many other wikis. You can use the API to do almost anything that you can do with the browser.
You want to use the API (rather than just downloading the full text of the HTML page as if you were a web browser) for a few reasons: it uses fewer resources (for you and Wikipedia), it is standardized, and it is very well supported in many different programming languages.
In [ ]:
!pip install wikipedia
import wikipedia
In this example, we will get the page for Berkeley, California and count the most commonly used words in the article. I'm using nltk, which is a nice library for natural language processing (although it is probably overkill for this).
In [ ]:
bky = wikipedia.page("Berkeley, California")
bky
In [ ]:
bk_split = bky.content.split()
In [ ]:
bk_split[:10]
In [ ]:
!pip install nltk
import nltk
In [ ]:
fdist1 = nltk.FreqDist(bk_split)
fdist1.most_common(10)
There are many functions in a Wikipedia page object. We can also get all the Wikipedia articles that are linked from a page, all the URL links in the page, or all the geographical coordinates in the page.
There was a study about which domains were most popular in Wikipedia articles.
In [ ]:
print(bky.references[:10])
In [ ]:
print(bky.links[:10])
pywikibot is one of the most well-developed and widely used libraries for querying the Wikipedia API. It does need a configuration script (user-config.py) in the directory where you are running the python script. It is often used by bots that edit, so there are many features that are not available unless you login with a Wikipedia account.
If you don't have one, register an account on Wikipedia. Then modify the string below so that the usernames line reads u'YourUserName'. You are not inputting your password, because you are not logging in with this account. This is just so that there is a place to contact you if your script goes out of control. This is not required to use pywikibot, but it is part of the rules for accessing Wikipedia's API.
In this tutorial, I'm not going to tell you how to set up OAuth so that you can login and edit. But if you are interested in this, I'd love to talk to you about it.
Note: you can edit pages with pywikibot (even when not logged in), but please don't! You have to get approval from Wikipedia's bot approval group, or else your IP address is likely to be banned.
In [ ]:
In [ ]:
user_config="""
family = 'wikipedia'
mylang = 'en'
usernames['wikipedia']['en'] = u'REPLACE THIS WITH YOUR USERNAME'
"""
In [ ]:
f = open('user-config.py', 'w')
f.write(user_config)
f.close()
In [ ]:
!pip install pywikibot
import pywikibot
In [ ]:
site = pywikibot.Site()
In [ ]:
bky_page = pywikibot.Page(site, "Berkeley, California")
bky_page
In [ ]:
# page text with all the wikimarkup and templates
bky_page_text = bky_page.text
# page text expanded to HTML
bky_page.expand_text()
In [ ]:
# All the geographical coordinates linked in a page (may have multiple per article)
bky_page.coordinates()
In [ ]:
In [ ]:
from pywikibot import pagegenerators
In [ ]:
cat = pywikibot.Category(site,'Category:Cities in Alameda County, California')
In [ ]:
gen = cat.members()
gen
In [ ]:
In [ ]:
# create an empty list
coord_d = []
In [ ]:
for page in gen:
print(page.title(), page.coordinates())
pc = page.coordinates()
for coord in pc:
# If the page is not a category
if(page.isCategory()==False):
coord_d.append({'label':page.title(), 'latitude':coord.lat, 'longitude':coord.lon})
In [ ]:
coord_d[:3]
In [ ]:
import pandas as pd
coord_df = pd.DataFrame(coord_d)
coord_df
Pages are only members of the direct category they are in. If a page is in a category, and that category is a member of another category, then it will not be shown through the members() function. The basic rule is that if you're on a category's Wikipedia page (like http://enwp.org/Category:Universities_and_colleges_in_California), the members are only the items that are blue links on that page. So you have to iterate through the category to recursively access subcategory members. This exercise is left to the readers. :)
Note: Many Wikipedia categories aren't necessarily restricted to the kind of entity that is mentioned in the category name. So "Category:Universities and colleges in California" contains a subcategory "Category:People by university or college in California" that has people asssociated with each university. So you have to be careful when recursively going through subcategories, or else you might end up with different kinds of entities.
In [ ]:
bay_cat = pywikibot.Category(site,'Category:Universities and colleges in California')
bay_gen = bay_cat.members()
In [ ]:
for page in bay_gen:
print(page.title(), page.isCategory(), page.coordinates())
Backlinks are all the pages that link to a page. Note: this can get very, very long with even minorly popular articles.
In [ ]:
telegraph_page = pywikibot.Page(site, u"Telegraph Avenue")
telegraph_backlinks = telegraph_page.backlinks
telegraph_backlinks()
In [ ]:
for bl_page in telegraph_backlinks():
if(bl_page.namespace()==1):
print(bl_page.title())
Who has contributed to a page, and how many times have they edited?
In [ ]:
telegraph_page.contributors()
Templates are all the extensions to wikimarkup that give you things like citations, tables, infoboxes, etc. You can iterate over all the templates in a page.
Wikipedia articles are filled with templates, which are kinds of scripts written in wikimarkup. Everything you see in a Wikipedia article that isn't a markdown-like feature (bolding, links, lists, images) is presented through a template. One of the most important templates are infoboxes, which are on the right-hand side of articles.
But templates are complicated and very difficult to parse -- which is why Wikidata is such a big deal! However, it is possible to parse the same kind of template with pywikibot's textlib parser. For infoboxes, there are different kinds of infoboxes based on what the article's topic is an instance of. So cities, towns, and other similar articles use "infobox settlement" -- which you can see by getting the first part of the article's wikitext.
In [ ]:
bky_page = pywikibot.Page(site, "Berkeley, California")
bky_page.text
If you go to the raw text on the Wikipedia (by clicking the edit button), you can see that this is a little bit more ordered:
We use the textlib module from pywikibot, which has a function that parses an article's wikitext into a list of templates. Each item in the list is an OrderedDict mapping parameters to values.
In [ ]:
from pywikibot import textlib
import pandas as pd
In [ ]:
bky_templates = textlib.extract_templates_and_params_regex(bky_page.text)
bky_templates[:5]
We iterate through all the templates on the page until we find the one containing the "Infobox settlement" template.
In [ ]:
for template in bky_templates:
if(template[0]=="Infobox settlement"):
infobox = template[1]
infobox.keys()
In [ ]:
print(infobox['elevation_ft'])
print(infobox['area_total_sq_mi'])
print(infobox['utc_offset_DST'])
print(infobox['population_total'])
However, sometimes parameters contain templates, such as citations or references.
In [ ]:
print(infobox['government_type'])
In [ ]:
print(infobox['website'])
In [ ]:
bay_cat = pywikibot.Category(site,'Category:Cities_in_the_San_Francisco_Bay_Area')
bay_gen = bay_cat.members()
for page in bay_gen:
# If the page is not a category
if(page.isCategory()==False):
print(page.title())
page_templates = textlib.extract_templates_and_params_regex(page.text)
for template in page_templates:
if(template[0]=="Infobox settlement"):
infobox = template[1]
if 'elevation_ft' in infobox:
print(" Elevation (ft): ", infobox['elevation_ft'])
if 'population_total' in infobox:
print(" Population: ", infobox['population_total'])
if 'area_total_sq_mi' in infobox:
print(" Area (sq mi): ", infobox['area_total_sq_mi'])
This is a script for Katy, getting data about U.S. Nuclear power plants. Wikipedia articles on nuclear power plants have many subcategories:
So we are going to begin with the Category:Nuclear power stations in the United States by state and just go one subcategory down. There is probably a more elegant way of doing this with recursion and functions....
In [ ]:
power_cat = pywikibot.Category(site,'Category:Nuclear power stations in the United States by state')
power_gen = power_cat.members()
for page in power_gen:
print(page.title())
# If the page is not a category
if(page.isCategory()==False):
print("\n",page.title(),"\n")
page_templates = textlib.extract_templates_and_params_regex(page.text)
for template in page_templates:
if(template[0]=="Infobox power station"):
infobox = template[1]
if 'ps_units_operational' in infobox:
print(" Units operational:", infobox['ps_units_operational'])
if 'owner' in infobox:
print(" Owner:", infobox['owner'])
else:
for subpage in pywikibot.Category(site,page.title()).members():
print("\n",subpage.title())
subpage_templates = textlib.extract_templates_and_params_regex(subpage.text)
for template in subpage_templates:
if(template[0]=="Infobox power station"):
infobox = template[1]
if 'ps_units_operational' in infobox:
print(" Units operational:", infobox['ps_units_operational'])
if 'owner' in infobox:
print(" Owner:", infobox['owner'])
In [ ]:
In [ ]: