By Stuart Geiger and Jamie Whitacre, made at a SciPy 2016 sprint. See the rendered, interactive, embedable map here.
The Github API is powerful. Almost anything you can do on Github can be done through the API. While this notebook is only taking you through the more passive functions that read data from Github, there are also many functions that let you make changes to Github. Be careful if you are trying out a new function!
We are using the githubPy library, and you are going to want to log in for much higher rate limits. You can put your username and password directly into a notebook (not recommended!) or put it in a file named "ghlogin.py" and then import it. Make sure that your ghlogin.py file is ignored by git in your .gitignore file.
We are using pygithub, geopy, and ipywidgets in this notebook. We are also using datetime, but that comes with python.
In [1]:
!pip install pygithub
!pip install geopy
!pip install ipywidgets
In [2]:
from github import Github
In [3]:
#this is my private login credentials, stored in ghlogin.py
import ghlogin
In [4]:
g = Github(login_or_token=ghlogin.gh_user, password=ghlogin.gh_passwd)
With this Github object, you can retreive all kinds of Github objects, which you can then futher explore.
A quick lightning tutorial inside this tutorial: there are many ways to explore the properties and methods of various objects. This is very useful when exploring a new method.
One way is to use tab completion, which is supported in Jupyter notebooks. Once you have executed code storing an object to a variable, type the variable name, then a dot, then hit tab to explore. If you don't have this, you can also use an extended version of the dir function. This vdir() function shows the methods and properties of an object, excluding those that begin with underscores (which are ones you will likely not use in this tutorial).
In [ ]:
In [43]:
def vdir(obj):
return [x for x in dir(obj) if not x.startswith('_')]
In [44]:
vdir(g)
Out[44]:
In [45]:
user = g.get_user("staeiou")
In [47]:
vdir(user)
Out[47]:
In [46]:
print(user.name)
print(user.created_at)
print(user.location)
Repositories work similarly to users. You have to call the name of the user or organization that owns the repository, then a slash, then the name of the repository. Some of these objects are easily printed (like name, description), while others are fully fledged Github objects in themselves, with many methods and properties (like organization or commit)
In [7]:
repo = g.get_repo("jupyter/notebook")
In [48]:
vdir(repo)
Out[48]:
In [ ]:
There are lots of properties or methods of objects that return other objects (like repos, users, organizations), and you can quickly access properties or methods of these objects with a dot.
There there are also methods that return lists of objects, like repo.get_commits() or repo.get_contributors(). You need to iterate through these lists, or access them with indexes. What you usually get from these lists are also objects that have their own properties and methods.
In [52]:
print(repo.name)
print(repo.description)
print(repo.organization)
print(repo.organization.name)
print(repo.organization.location)
print(repo.language)
print(repo.get_contributors())
print(repo.get_commits())
In [53]:
commits = repo.get_commits()
commit = commits[0]
print("Author name: ", commit.author.name)
print("Committer name: ", commit.committer.name)
print("Lines added: ", commit.stats.additions)
print("Lines deleted: ", commit.stats.deletions)
print("Commit message:\n---------\n", commit.commit.message)
In [54]:
import datetime
In [55]:
one_month_ago = datetime.datetime.now() - datetime.timedelta(days=30)
net_lines_added = 0
num_commits = 0
for commit in repo.get_commits(since = one_month_ago):
net_lines_added += commit.stats.additions
net_lines_added -= commit.stats.deletions
num_commits += 1
print(net_lines_added, num_commits)
In [56]:
dir(issue)
Out[56]:
In [57]:
issues = repo.get_issues()
for issue in issues:
last_updated_delta = datetime.datetime.now() - issue.updated_at
if last_updated_delta > datetime.timedelta(days=365):
print(issue.title, last_updated_delta.days)
Organizations are objects too, which have similar properties:
In [58]:
org = g.get_organization("jupyter")
In [59]:
print(org.name)
print(org.created_at)
print(org.html_url)
We can go through all the repositories in the organization with the get_repos() function. It returns a list of repository objects, which have their own properties and methods.
In this example, we are iterating through all the repositories in an organization, then for an empty dictionary, setting the key to the repository's name and the value to the number of times the repository has been forked.
In [75]:
repos = {}
for repo in org.get_repos():
repos[repo.name] = repo.forks_count
repos
Out[75]:
Before we get into how to query GitHub, we know we will have to get location coordinates for each contributor, and then plot it on a map. So we are going to do that first.
For geolocation, we are using geopy's geolocator object, which is based on Open Street Map's Nominatim API. Nominatim takes in any arbitrary location data and then returns a location object, which includes the best latitude and longitude coordinates it can find.
This does mean that we will have more error than if we did this manually, and there might be vastly different levels of accuracy. For example, if someone just has "UK" as their location, it will show up in the geographic center of the UK, which is somewhere on the edge of the Lake District. "USA" resolves to somewhere in Kansas. However, you can get very specific location data if you put in more detail.
In [24]:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
uk_loc = geolocator.geocode("UK")
print(uk_loc.longitude,uk_loc.latitude)
us_loc = geolocator.geocode("USA")
print(us_loc.longitude,us_loc.latitude)
bids_loc = geolocator.geocode("Doe Library, Berkeley CA, 94720 USA")
print(bids_loc.longitude,bids_loc.latitude)
We can plot points on a map using ipyleaflets and ipywidgets. We first set up a map object, which is created with various parameters. Then we create Marker objects, which are then appended to the map. We then display the map inline in this notebook.
In [25]:
import ipywidgets
from ipyleaflet import (
Map,
Marker,
TileLayer, ImageOverlay,
Polyline, Polygon, Rectangle, Circle, CircleMarker,
GeoJSON,
DrawControl
)
center = [30.0, 5.0]
zoom = 2
m = Map(default_tiles=TileLayer(opacity=1.0), center=center, zoom=zoom, layout=ipywidgets.Layout(height="600px"))
uk_mark = Marker(location=[uk_loc.latitude,uk_loc.longitude])
uk_mark.visible
m += uk_mark
us_mark = Marker(location=[us_loc.latitude,us_loc.longitude])
us_mark.visible
m += us_mark
bids_mark = Marker(location=[bids_loc.latitude,bids_loc.longitude])
bids_mark.visible
m += bids_mark
In [ ]:
Now that we have made a few requests, we can see what our rate limit is. If you are logged in, you get 5,000 requests per hour. If you are not, you only get 60 per hour. You can use methods in the GitHub object to see your remaining queries, hourly limit, and reset time. We have used less than 100 of our 5,000 requests with these calls.
In [20]:
g.rate_limiting
Out[20]:
In [21]:
reset_time = g.rate_limiting_resettime
reset_time
Out[21]:
This value is in seconds since the UTC epoch (Jan 1st, 1970), so we have to convert it. Here is a quick function that takes a GitHub object, queries the API to find our next reset time, and converts it to minutes.
In [22]:
import datetime
def minutes_to_reset(github):
reset_time = github.rate_limiting_resettime
timedelta_to_reset = datetime.datetime.fromtimestamp(reset_time) - datetime.datetime.now()
return timedelta_to_reset.seconds / 60
In [23]:
minutes_to_reset(g)
Out[23]:
For our mapping script, we want to get profiles for everyone who has made a commit to any of the repositories in the Jupyter organization, find their location (if any), then add it to a list. The API has a get_contributors function for repo objects, which returns a list of contributors ordered by number of commits, but not one that works across all repos in an org. So we have to iterate through all the repos in the org, and run the get_contributors method for We also want to make sure we don't add any duplicates to our list to over-represent any areas, so we keep track of people in a dictionary.
I've written a few functions to make it easy to retreive and map an organization's contributors.
In [26]:
def get_org_contributor_locations(github, org_name):
"""
For a GitHub organization, get location for contributors to any repo in the org.
Returns a dictionary of {username URLS : geopy Locations}, then a dictionary of various metadata.
"""
# Set up empty dictionaries and metadata variables
contributor_locs = {}
locations = []
none_count = 0
error_count = 0
user_loc_count = 0
duplicate_count = 0
geolocator = Nominatim()
# For each repo in the organization
for repo in github.get_organization(org_name).get_repos():
#print(repo.name)
# For each contributor in the repo
for contributor in repo.get_contributors():
print('.', end="")
# If the contributor_locs dictionary doesn't have an entry for this user
if contributor_locs.get(contributor.url) is None:
# Try-Except block to handle API errors
try:
# If the contributor has no location in profile
if(contributor.location is None):
#print("No Location")
none_count += 1
else:
# Get coordinates for location string from Nominatim API
location=geolocator.geocode(contributor.location)
#print(contributor.location, " | ", location)
# Add a new entry to the dictionary. Value is user's URL, key is geocoded location object
contributor_locs[contributor.url] = location
user_loc_count += 1
except Exception:
print('!', end="")
error_count += 1
else:
duplicate_count += 1
return contributor_locs,{'no_loc_count':none_count, 'user_loc_count':user_loc_count,
'duplicate_count':duplicate_count, 'error_count':error_count}
With this, we can easily query an organization. The U.S. Digital Service (org name: usds) is a small organization that works well for testing these kinds of queries. It takes about a second per contributor to get this data, so we want to test on small orgs first. To show the status, it prints a period for each successful query and an exclaimation point for each error.
The get_org_contributor_locations function takes a Github object and an organization name, and returns two dictionaries: one of user and location data, and one of metadata about the geolocation query (including the number of users without a location in their profile).
In [27]:
usds_locs, usds_metadata = get_org_contributor_locations(g,'usds')
In [28]:
usds_metadata
Out[28]:
We are going to explore this dataset, but not plot names or usernames. I'm a bit hesitant to publish location data with unique identifiers, even if people put that information in their profiles. This code iterates through the dictionary and puts location data into a list.
In [29]:
usds_locs_nousernames = []
for contributor, location in usds_locs.items():
usds_locs_nousernames.append(location)
usds_locs_nousernames
Out[29]:
Now we can map this data using another function I have written.
In [81]:
def map_location_dict(map_obj,org_location_dict):
"""
Maps the locations in a dictionary of {ids : geoPy Locations}.
Must be passed a map object, then the dictionary. Returns the map object.
"""
for username, location in org_location_dict.items():
if(location is not None):
mark = Marker(location=[location.latitude,location.longitude])
mark.visible
map_obj += mark
return map_obj
In [ ]:
In [82]:
center = [30.0,5.0]
zoom = 2
usds_map = Map(default_tiles=TileLayer(opacity=1.0), center=center, zoom=zoom, layout=ipywidgets.Layout(height="600px"))
usds_map = map_location_dict(usds_map, usds_locs)
Now show the map inline! With the leaflet widget, you can zoom in and out directly in the notebook. And we can also export it to an html widget by going to the Widgets menu in Jupyter notebooks, clicking "Embed widgets," and copy/pasting this to an html file. It will not show up in rendered Jupyter notebooks on Github, but may show up in nbviewer.
In [ ]:
usds_map
In [ ]: