Homework 5: Networks and Congress

Due Friday, November 15, 11:59pm



In [1]:
%matplotlib inline

import json

import numpy as np
import networkx as nx
import requests
from pattern import web
import matplotlib.pyplot as plt

# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

The website govtrack.us collects data on activities in the Senate and House of Representatives. It's a great source of information for making data-driven assessments about Congress.

Problem 1.

The directories at http://www.govtrack.us/data/congress/113/votes/2013 contain JSON information about every vote cast for the current (113th) Congress. Subdirectories beginning with "S" correspond to Senate votes, while subdirectories beginning with "H" correspond to House votes.

Write two functions: one that downloads and parses a single Senate vote page given the vote number, and another that repeatedly calls this function to build a full collection of Senate votes from the 113th Congress.


In [2]:
"""
Function
--------
get_senate_vote

Scrapes a single JSON page for a particular Senate vote, given by the vote number

Parameters
----------
vote : int
   The vote number to fetch
   
Returns
-------
vote : dict
   The JSON-decoded dictionary for that vote
   
Examples
--------
>>> get_senate_vote(11)['bill']
{u'congress': 113,
 u'number': 325,
 u'title': u'A bill to ensure the complete and timely payment of the obligations of the United States Government until May 19, 2013, and for other purposes.',
 u'type': u'hr'}
"""
#your code here

def get_senate_vote(vote):
    url = 'http://www.govtrack.us/data/congress/113/votes/2013/s%i/data.json' % vote
    page = requests.get(url).text
    return json.loads(page)

In [3]:
"""
Function
--------
get_all_votes

Scrapes all the Senate votes from http://www.govtrack.us/data/congress/113/votes/2013,
and returns a list of dicts

Parameters
-----------
None

Returns
--------
votes : list of dicts
    List of JSON-parsed dicts for each senate vote
"""
#Your code here

def get_all_votes():
    page = requests.get('https://www.govtrack.us/data/congress/113/votes/2013/').text
    dom = web.Element(page)
    votes = [a.attr['href'] for a in dom.by_tag('a') 
             if a.attr.get('href', '').startswith('s')]
    n_votes = len(votes)
    return [get_senate_vote(i) for i in range(1, n_votes + 1)]

In [4]:
#vote_data = get_all_votes()
vote_data = json.load(open('vote_data.json'))

Problem 2

Now, turn these data into a NetworkX graph, according to the spec below. For details on using NetworkX, consult the lab materials for November 1, as well as the NetworkX documentation.


In [5]:
"""
Function
--------
vote_graph

Parameters
----------
data : list of dicts
    The vote database returned from get_vote_data

Returns
-------
graph : NetworkX Graph object, with the following properties
    1. Each node in the graph is labeled using the `display_name` of a Senator (e.g., 'Lee (R-UT)')
    2. Each node has a `color` attribute set to 'r' for Republicans, 
       'b' for Democrats, and 'k' for Independent/other parties.
    3. The edges between two nodes are weighted by the number of 
       times two senators have cast the same Yea or Nay vote
    4. Each edge also has a `difference` attribute, which is set to `1 / weight`.

Examples
--------
>>> graph = vote_graph(vote_data)
>>> graph.node['Lee (R-UT)']
{'color': 'r'}  # attributes for this senator
>>> len(graph['Lee (R-UT)']) # connections to other senators
101
>>> graph['Lee (R-UT)']['Baldwin (D-WI)']  # edge relationship between Lee and Baldwin
{'difference': 0.02, 'weight': 50}
"""
#Your code here

def _color(s):
    if '(R' in s:
        return 'r'
    if '(D' in s:
        return 'b'
    return 'k'
    
def vote_graph(data):
    
    senators = set(x['display_name'] for d in data for vote_grp in d['votes'].values() for x in vote_grp)
    weights = {s: {ss: 0 for ss in senators if ss != s} for s in senators}
    
    for d in data:
        for grp in ['Yea', 'Nay']:
            if grp not in d['votes']:
                continue
            vote_grp = d['votes'][grp]
            for i in range(len(vote_grp)):
                for j in range(i + 1, len(vote_grp)):
                    sen1 = vote_grp[i]['display_name']
                    sen2 = vote_grp[j]['display_name']                
                    weights[min(sen1, sen2)][max(sen1, sen2)] += 1
                    
    g = nx.Graph()
    for s in senators:
        g.add_node(s)
        g.node[s]['color'] = _color(s)
    
    for s1, neighbors in weights.items():
        for s2, weight in neighbors.items():
            if weight == 0:
                continue
            g.add_edge(s1, s2, weight= weight, difference = 1. / weight)
            
    return g

In [6]:
votes = vote_graph(vote_data)

How (and how not) to visualize networks

Network plots often look impressive, but creating sensible network plots is tricky. From Ben Fry, the author of the Processing program:

Usually a graph layout isn’t the best option for data sets larger than a few dozen nodes. You’re most likely to wind up with enormous spider webs or balls of string, and the mess seen so far is more often the case than not. Graphs can be a powerful way to represent relationships between data, but they are also a very abstract concept, which means that they run the danger of meaning something only to the creator of the graph. Often, simply showing the structure of the data says very little about what it actually means, even though it’s a perfectly accurate means of representing the data. Everything looks like a graph, but almost nothing should ever be drawn as one.

Let's look at bad and better ways of visualizing the senate vote network.

First, consider the "default" plot from networkx.


In [7]:
#this makes sure draw_spring results are the same at each call
np.random.seed(1)  

color = [votes.node[senator]['color'] for senator in votes.nodes()]

#determine position of each node using a spring layout
pos = nx.spring_layout(votes, iterations=200)

#plot the edges
nx.draw_networkx_edges(votes, pos, alpha = .05)

#plot the nodes
nx.draw_networkx_nodes(votes, pos, node_color=color)

#draw the labels
lbls = nx.draw_networkx_labels(votes, pos, alpha=5, font_size=8)

#coordinate information is meaningless here, so let's remove it
plt.xticks([])
plt.yticks([])
remove_border(left=False, bottom=False)


The spring layout tries to group nodes with large edge-weights near to each other. In this context, that means it tries to organize the Senate into similarly-voting cliques. However, there's simply too much going on in this plot -- we should simplify the representation.

Problem 3

Compute the Minimum Spanning Tree of this graph, using the difference edge attribute as the weight to minimize. A Minimum Spanning Tree is the subset of edges which trace at least one path through all nodes ("spanning"), with minimum total edge weight. You can think of it as a simplification of a network.

Plot this new network, making modifications as necessary to prevent the graph from becoming too busy.


In [68]:
#Your code here
plt.figure(figsize=(15, 10))
np.random.seed(5)
mst = nx.minimum_spanning_tree(votes, weight='difference')
pos = nx.spring_layout(mst, iterations=900, k=.008, weight='difference')

mst_edges = list(nx.minimum_spanning_edges(votes, weight='difference'))

nl = votes.nodes()
c = [votes.node[n]['color'] for n in nl]
nx.draw_networkx_edges(votes, pos, edgelist=mst_edges, alpha=.2)
nx.draw_networkx_nodes(votes, pos, nodelist = nl, node_color = c, node_size=60)

for p in pos.values():
    p[1] += .02
    
nx.draw_networkx_labels(votes, pos, font_color='k', font_size=7)

plt.title("MST of Vote Disagreement", fontsize=18)
plt.xticks([])
plt.yticks([])
remove_border(left=False, bottom=False)


Problem 4

While this graph has less information, the remaining information is easier to digest. What does the Minimum Spanning Tree mean in this context? How does this graph relate to partisanship in the Senate? Which nodes in this graph are the most and least bi-partisan?

The edges of a minimum spanning tree trace a path of low resistance through the network. In the present context, this has the effect of moving bipartisan Senators like Hagan towards the center of the graph -- it is much easier to connect Hagan to a Republican node than, say, a partisan Democrat like Al Franken. Partisan Senators are pushed away from the center of the graph and deeper into the party cliques.

This scheme also moves outlier senators to the outside of the graph. For example, John Kerry cast very few votes before becoming Secretary of State. Most of the edges connected to John Kerry have large difference values, so the fewest possible number of edges (1) remain in the MST.

Problem 5

(For this problem, use the full graph for centrality computation, and not the Minimum Spanning Tree)

Networkx can easily compute centrality measurements.

Briefly discuss what closeness_centrality means, both mathematically and in the context of the present graph -- how does the centrality relate to partisanship? Choose a way to visualize the closeness_centrality score for each member of the Senate, using edge difference as the distance measurement. Determine the 5 Senators with the highest and lowest centralities.

Comment on your results. In particular, note the outliers John Kerry (who recently resigned his Senate seat when he became Secretary of State), Mo Cowan (Kerry's interim replacement) and Ed Markey (Kerry's permanent replacement) have low centrality scores -- why?

The closeness centrality measures the average difference between a Senator and all other Senators. Bipartisan voters will have more shared votes with the members of the opposite party, which tends to increase their centrality. However, these senators also vote less often with their own party, which can decrease centrality

Centrality scores are also small for people who haven't cast many votes (like John Kerry, Mo Cowan, and Ed Markey). This says nothing about bipartisanship


In [111]:
#Your code here

bet = nx.closeness_centrality(votes, distance='difference')
bipartisans = sorted(bet, key=lambda x: -bet[x])

print "Highest closeness"
for senator in bipartisans[:5]:
    print "%20.20s\t%0.3f" % (senator, bet[senator])
print
print "Lowest closeness"
for senator in bipartisans[-5:]:
    print "%20.20s\t%0.3f" % (senator, bet[senator])
    

plt.figure(figsize=(15, 4))
x = np.arange(len(nl))
y = np.array([bet[n] for n in nl])
c = np.array([votes.node[n]['color'] for n in nl])

ind = np.argsort(y)
y = y[ind]
c = c[ind]

plt.bar(x, y, color=c, align='center', width=.8)

remove_border(left=None, bottom=None)
ticks = plt.xticks(x, [nl[i] for i in x[ind]], 
                   rotation='vertical', fontsize=7)
limits = plt.xlim(-1, x[-1] + 1)


Highest closeness
      Collins (R-ME)	98.685
      Manchin (D-WV)	95.837
        Pryor (D-AR)	95.008
     Donnelly (D-IN)	93.965
        Hagan (D-NC)	93.940

Lowest closeness
       Markey (D-MA)	42.925
       Chiesa (R-NJ)	39.446
   Lautenberg (D-NJ)	25.260
       Booker (D-NJ)	11.866
        Kerry (D-MA)	3.945

Problem 6

Centrality isn't a perfect proxy for bipartisanship, since it gauges how centralized a node is to the network as a whole, and not how similar a Democrat node is to the Republican sub-network (and vice versa).

Can you come up with another measure that better captures bipartisanship than closeness centrality? Develop your own metric -- how does it differ from the closeness centrality? Use visualizations to support your points.


In [10]:
#your code here

"""
Here, we compute the mean weight for the edges that connect a Senator
to a node in the other party (we consider Independents to be Democrats
for this analysis).

This only considers how similarly a Senator votes with the other party.

The scatter plot shows that the betweenness centrality and bipartisan score
correlate with each other. However, the betweenness centrality judges Democrats
to be more bipartisan as a whole. Part of this is a bias due to the fact
that Democrats are the majority party in the Senate right now, so their
votes are considered more "central" due to their bigger numbers.
"""
def bipartisan_score(graph, node):
    party = graph.node[node]['color']
    other = 'r' if party != 'r' else 'b'
    return np.mean([v['weight'] for k, v in graph[node].items() if graph.node[k]['color'] == other])

bp_score = {node: bipartisan_score(votes, node) for node in votes.nodes()}
bp2 = sorted(bp_score, key=lambda x: -1 * bp_score[x])

print "Most Bipartisan"
for senator in bp2[:5]:
    print "%20.20s\t%0.3f" % (senator, bp_score[senator])

print
print "Least Bipartisan"
for senator in bp2[-5:]:
    print "%20.20s\t%0.3f" % (senator, bp_score[senator])

    
senators = bp_score.keys()
x = [bet[s] for s in senators]
y = [bp_score[s] for s in senators]
c = [votes.node[s]['color'] for s in senators]

plt.scatter(x, y, 80, color=c, 
            alpha=.5, edgecolor='white')
plt.xlabel("Betweenness Centrality")
plt.ylabel("Bipartisan Score")
remove_border()


Most Bipartisan
      Collins (R-ME)	153.089
    Murkowski (R-AK)	132.875
      Manchin (D-WV)	118.500
        Pryor (D-AR)	112.913
       McCain (R-AZ)	112.393

Least Bipartisan
       Chiesa (R-NJ)	39.038
       Markey (D-MA)	23.435
   Lautenberg (D-NJ)	9.289
       Booker (D-NJ)	3.659
        Kerry (D-MA)	2.235

Leadership in the Senate

There are many metrics to quantify the leadership in the Senate.

  1. Senate leaders sponsor and co-sponsor lots of bills
  2. Leaders sit on many committees, as well as more important committees
  3. Leaders usually have been in office for a long time

Another approach uses the philosophy behind how Google ranks search results. The core idea behind Google's PageRank algorithm is:

  1. A "good" website (i.e. one to rank highly in search results) is linked to by many other websites
  2. A link found on a "good" website is more important than a link found on a "bad" website

The PageRank algorithm thus assigns scores to nodes in a graph based on how many neighbors a node has, as well as the score of those neighbors.

This technique can be adapted to rank Senate leadership. Here, nodes correspond to Senators, and edges correspond to a senator co-sponsoring a bill sponsored by another Senator. The weight of each edge from node A to B is the number of times Senator A has co-sponsored a bill whose primary sponsor is Senator B. If you interpret the PageRank scores of such a network to indicate Senate leadership, you are then assuming:

  1. Leaders sponsor more bills
  2. Leaders attract co-sponsorship from other leaders

Problem 7

Govtrack stores information about each Senate bill in the current congress at http://www.govtrack.us/data/congress/113/bills/s/. As in problem 1, write two functions to scrape these data -- the first function downloads a single bill, and the second function calls the first to loop over all bills.


In [11]:
"""
Function
--------
get_senate_bill

Scrape the bill data from a single JSON page, given the bill number

Parameters
-----------
bill : int
   Bill number to fetch
   
Returns
-------
A dict, parsed from the JSON

Examples
--------
>>> bill = get_senate_bill(10)
>>> bill['sponsor']
{u'district': None,
 u'name': u'Reid, Harry',
 u'state': u'NV',
 u'thomas_id': u'00952',
 u'title': u'Sen',
 u'type': u'person'}
>>> bill['short_title']
u'Agriculture Reform, Food, and Jobs Act of 2013'
"""
#your code here
def get_senate_bill(bill):
    url = 'http://www.govtrack.us/data/congress/113/bills/s/s%i/data.json' % bill
    page = requests.get(url).text
    return json.loads(page)

In [12]:
"""
Function
--------
get_all_bills

Scrape all Senate bills at http://www.govtrack.us/data/congress/113/bills/s

Parameters
----------
None

Returns
-------
A list of dicts, one for each bill
"""
#your code here
def get_all_bills():
    page = requests.get('http://www.govtrack.us/data/congress/113/bills/s/').text
    dom = web.Element(page)
    links = [a.attr['href'] for a in dom.by_tag('a') 
             if a.attr.get('href', '').startswith('s')]
    return [get_senate_bill(i) for i in range(1, len(links) + 1)]

In [13]:
#bill_list = get_all_bills()
bill_list = json.load(open('bill_list.json'))

Problem 8

Write a function to builded a Directed Graph (DiGraph) from these data, according to the following spec:


In [14]:
"""
Function
--------
bill_graph

Turn the bill graph data into a NetworkX Digraph

Parameters
----------
data : list of dicts
    The data returned from get_all_bills

Returns
-------
graph : A NetworkX DiGraph, with the following properties
    * Each node is a senator. For a label, use the 'name' field 
      from the 'sponsor' and 'cosponsors' dict items
    * Each edge from A to B is assigned a weight equal to how many 
      bills are sponsored by B and co-sponsored by A
"""
#Your code here

def bill_graph(data):
    
    sp = nx.DiGraph()

    for bill in data:
        sponsor = bill['sponsor']['name']
        sponsor_data = bill['sponsor']
    
        cosponsors = [cs['name'] for cs in bill['cosponsors']]
    
        if sponsor not in sp:
            sp.add_node(sponsor, **sponsor_data)
        
        for cosponsor in bill['cosponsors']:
            if cosponsor['name'] not in sp:
                sp.add_node(cosponsor['name'], **cosponsor)            
            cosponsor = cosponsor['name']
        
            try:
                w = sp[cosponsor][sponsor]['weight'] + 1
            except KeyError:
                w = + 1
            sp.add_edge(cosponsor, sponsor, weight=w)    

    return sp

In [15]:
bills = bill_graph(bill_list)

Problem 9

Using nx.pagerank_numpy, compute the PageRank score for each senator in this graph. Visualize the results. Determine the 5 Senators with the highest PageRank scores. How effective is this approach at identifying leaders? How does the PageRank rating compare to the degree of each node?

Note: you can read about individual Senators by searching for them on the govtrack website.


In [16]:
#Your code here

pagerank = nx.pagerank_numpy(bills)
names = np.array(pagerank.keys())
vals = np.array([pagerank[n] for n in names])

ind = np.argsort(vals)
names = names[ind]
vals = vals[ind]

print "Highest Scores"
for n, v in zip(names, vals)[-5:][::-1]:
    print "%20.20s\t%0.3f" % (n, v)

print
print "Lowest Scores" 
for n, v in zip(names, vals)[:5]:
    print "%20.20s\t%0.3f" % (n, v)


Highest Scores
      Brown, Sherrod	0.030
         Harkin, Tom	0.030
         Reid, Harry	0.027
Lautenberg, Frank R.	0.026
    Menendez, Robert	0.026

Lowest Scores
        Chiesa, Jeff	0.001
      Kerry, John F.	0.001
   Cowan, William M.	0.001
      Sessions, Jeff	0.001
          Scott, Tim	0.001

In [17]:
#Your code here

deg = nx.degree(bills)

plt.scatter([deg[n] for n in bills.nodes()],
            [pagerank[n] for n in bills.nodes()], 80, alpha=.8, 
            color='k', edgecolor='white')

labels = ['Reid, Harry', 'Lautenberg, Frank R.', 'Menendez, Robert', 'Harkin, Tom']
for lbl in labels:
    plt.annotate(lbl, (deg[lbl], pagerank[lbl] + .002), fontsize=10, rotation=10)
    
plt.xlabel("Degree")
plt.ylabel("PageRank")
remove_border()


The PageRank approach does seem to be effective at identifying influential Senators like Tom Harkin and Harry Reid (the Majority Leader). We see in particular that Harry Reid's PageRank score is relatively higher than his degree -- he seems to sponsor fewer bills overall, but those bills appear to be more important. This makes sense, since he is the figurehead of the Democratic party in the Senate, and thus probably focuses on the highest-profile legislation.

Interactive Visualization

Producing a good node link layout is not quite so simple. Nevertheless, we will give it a try.

We will use Gephi for interactive graph visualization. Gephi supports a wide variety of graph file formats, and NetworkX exports to several of them. We'll use the Graph Exchange XML Format (GEXF).


In [18]:
nx.write_gexf(votes, 'votes.gexf')

Problem 10: Analysis with Gephi

Download and install Gephi. See the lab for a brief introduction. Load the exported votes file. Try to produce a layout that clearly separates Democrats from Republicans (hint: filter on edge weight and re-layout once you filtered). Run PageRank and some other statistics and try encoding them with node color and node size. Run the "Modularity" statistic and encode the results in color.

Include a screenshot of your "best" visualization and embed the image here with IPython.display.Image. Make sure to include this image in your submission.

Explain your observations. Is the network visualization very helpful? Try to visualize your LinkedIn network (see the lab) or the one provided in the lab. Which dataset is more suitable for visualization and why is there a difference?

David Chouinard did a very nice job with this -- go look at his page. Notice how it is generally difficult to study the details of very complex networks -- we can see bipartisan structure, but not much else. The LinkedIn graphs are easier to digest and interpret

How to Submit

To submit your homework, create a folder named lastname_firstinitial_hw5 and place this notebook file in the folder. Double check that this file is still called HW5.ipynb, and that it contains your solutions. Also include any Gephi screenshots. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work.


css tweaks in this cell