Daina Bouquin
Below is an analysis of affiliations between authors and journals in the 2-mode NASA Astrophysics Data Systems dataset. This project builds on work performed in Project 2. The primary objective of this project is to use clustering techniques (e.g. the island method) to try to find small sub-networks of important authors that are frequently collaborating together. In doing so we can also see which journals stand out as focal points for these types of collaborations.
In [50]:
import networkx as nx
import os
import ads as ads
import matplotlib.pyplot as plt
import pandas as pd
from networkx.algorithms import bipartite as bi
In [51]:
os.environ["ADS_DEV_KEY"] = "kNUoTurJ5TXV9hsw9KQN1k8wH4U0D7Oy0CJoOvyw"
In [52]:
ads.config.token = 'ADS_DEV_KEY'
In [59]:
#Search for papers (50 most cited) on stars (very general search)
papers1 = list(ads.SearchQuery(q= "stars", sort="citation_count", max_pages=1 ))
In [60]:
# find author names
a = []
for i in papers1:
authors1 = i.author
a.append(authors1)
author_names = a
In [62]:
# find the journals
j = []
for i in papers1:
journals1 = i.pub
j.append(journals1)
journals = j
In [63]:
# create an initial df
df = pd.DataFrame({'Author_Names' : author_names,
'Journal':journals
})
In [64]:
# Expand the df with melt
s1 = df.apply(lambda x: pd.Series(x['Author_Names']),axis=1).stack().reset_index(level=1, drop=True)
s1.name = 'Author_Name'
df_m = df.drop('Author_Names', axis=1).join(s1)
df_m.head()
Out[64]:
In [65]:
author_nodes = pd.DataFrame(df_m.Author_Name.unique(),columns=['Author_Name'])
author_nodes['node_type'] = 'Author_Name'
journal_nodes = pd.DataFrame(df_m.Journal.unique(), columns=['Journal'])
journal_nodes['node_type'] = 'Journal'
In [66]:
# Build the graph from the node sets and edges
# set bipartite attribute to ensure weighted projection will work
a_nodes = list(author_nodes['Author_Name'])
j_nodes = list(journal_nodes['Journal'])
edge_bunch = [tuple(i) for i in df_m.values]
g = nx.Graph()
g.add_nodes_from(a_nodes,node_type='Author_Name', bipartite=0)
g.add_nodes_from(j_nodes,node_type='Jurnal', bipartite=1)
g.add_edges_from(edge_bunch)
In [67]:
# Weighted Projections/Clustering
# find the largest most connected graph - 200 as cut-off
big_subg = [i for i in nx.connected_component_subgraphs(g) if len(i) > 200]
# Largest:
sg_largest = big_subg[0] # largest connected subgraph
In [68]:
# weighted_projections can be applied to this subgraph to separate the two components
Journals,Author_Names = bi.sets(sg_largest) # split into bipartites
In [70]:
j_proj_sg_largest = bi.weighted_projected_graph(sg_largest, Journals)
In [72]:
a_proj_sg_largest = bi.weighted_projected_graph(sg_largest, Author_Names)
In [74]:
# Use the Island Method
j = j_proj_sg_largest.edges(data=True)
a = a_proj_sg_largest.edges(data=True)
In [77]:
# Find weights in the projections that are greater than 1
print len([i for i in a if i[2]['weight'] > 1])
print len([i for i in j if i[2]['weight'] > 1])
In [79]:
# With a min threshold of edge weight = 1, find the nodes with strong relationships within the sub-graphs.
# tidy (SNAS Ch. 4) function similar to the one presented in Social Network Analysis Chapter 4.
def tidy(g, weight):
g_temp = nx.Graph()
edge_bunch2 = [i for i in g.edges(data=True) if i[2]['weight'] > weight]
g_temp.add_edges_from(edge_bunch2)
return g_temp
In [81]:
a_sg_island = tidy(a_proj_sg_largest, 1)
j_sg_island = tidy(j_proj_sg_largest,1)
We now have two islands of the projected authors and journals. Examining the degree centrality will help reveal which nodes are the key to the networks.
In [102]:
# degree centrality of both island clusters
a_degree = nx.degree_centrality(a_sg_island)
j_degree = nx.degree_centrality(j_sg_island)
pd.DataFrame.from_dict(a_degree,orient='index').sort_values(0,ascending=False).head()
Out[102]:
In [103]:
pd.DataFrame.from_dict(j_degree,orient='index').sort_values(0,ascending=False).head()
Out[103]:
Now that the islands are isolated, we can subset them into their largest connected subgraphs and do some basic plots.
In [88]:
# examine the connected subgraphs
j_connected = [i for i in nx.connected_component_subgraphs(j_proj_sg_largest) if len(i) > 1]
a_connected = [i for i in nx.connected_component_subgraphs(a_proj_sg_largest) if len(i) > 1]
In [92]:
# combining the graphs
def merge_graph(connected_g):
g = nx.Graph()
for h in connected_g:
g = nx.compose(g,h)
return g
a_islands = merge_graph(a_connected)
j_islands = merge_graph(j_connected)
In [96]:
nx.draw(a_islands)
In [100]:
nx.draw(j_islands)
pos=nx.circular_layout(j_islands)
In [ ]: