This notebook is dedicated to comparison of two journal articles, two ideas. This can be interpreted as meta-research. For now, just to test our capabilities and to discover problematics of package we are going to compare what kind of disciplines mentioned in different codes.

Here we will use three publications:

  1. Consensus map of science by Richard Klavans
  2. Kevin W. Boyack and Biglan's classification of disciplines.
  3. Zhang

We will start by analysis by analyzing each article. Then, in second part, we will see what we can get by fusing them together.

Consensus map of science


In [67]:
from disciplines.theory import consensus_map_of_science
from IPython.display import Image
i = Image(url='http://wiki.cns.iu.edu/download/attachments/1245876/worddavecfd04f904a8c7a15eaac3c2b9a6305a.png')
i


Out[67]:

If we look into the the article we would find main findings filled in form of table. We did enter data from table into Python's tables and dictionaries. (In future we will do it ina format that will be more pandas friendly)


In [68]:
discipline_name_dictionary = consensus_map_of_science.name_dict
consensus_table = consensus_map_of_science.TABLE_3
consensus_map_of_science.TABLE_3.keys()


Out[68]:
['table', 'name', 'columns']

In [69]:
consensus_map_of_science.TABLE_3['columns']


Out[69]:
['Rank', 'Pair', 'N', 'N-poss', '%']

In [70]:
consensus_map_of_science.TABLE_3['table'][0:2]


Out[70]:
[[1, 'B', 'BC', 20, 20, 100.0], [2, 'I', 'MD', 20, 20, 100.0]]

In [71]:
import pandas as pd
df = pd.DataFrame([x[1:] for x in consensus_table['table']])
df.columns = consensus_table['columns']
df.index = range(1, len(df) +1) # Here we do some reindexing so it stays closer to original form.

In [72]:
df['Rank'] = df['Rank'].apply(lambda x: discipline_name_dictionary[x])
df['Pair'] = df['Pair'].apply(lambda x: discipline_name_dictionary[x])

df['Rank'] = df['Rank'].str.rstrip(' ') 
df['Pair'] = df['Pair'].str.rstrip(' ')

In [73]:
unique_disciplines_in_consensus = set(df['Rank'].unique()) | set(df['Pair'].unique())

Graph object


In [88]:
from disciplines.theory.consensus_map_of_science import TABLE_3
from disciplines.theory.consensus_map_of_science import name_dict
g_cms = nx.Graph()
#g_cms.add_edges_from([(x[1], x[2], {'weight':x[5]}) for x in TABLE_3['table']])
g_cms.add_edges_from([(name_dict[x[1]], name_dict[x[2]], {'weight':x[4]}) for x in TABLE_3['table']])# 
nx.draw(g_cms)


here we have a list of disciplines, or formations, mentioned in consensus map of science module. Can we compare it to disciplines mentioned elsewhere? What aboug Biglans?

Biglan classification of sciences

A separate notebook goes deeper into the subject. Biglan


In [74]:
from disciplines.theory import biglan
columns = ['discipline','pure', 'hard', 'life']
the = biglan.the_classification
mylist = []
for line in the:
    for discipline in line[0]:
        mylist.append([discipline, line[1]['pure'], line[1]['hard'], line[1]['life']])
df_biglan = pd.DataFrame(mylist, columns=columns)
df_biglan['discipline'] = df_biglan['discipline'].str.rstrip(' ')
unique_disciplines_in_biglan = set(df_biglan['discipline'].unique())

Currently we have little problems, like psychology and psychiatry is made into one. Also, brain research has second name that is neuroscience. Earth sciences have second name that is geoscience. We have to somehow handle this complexity


In [75]:
def see_similarity(terms):
    '''
    We except terms to be splited because of two reasons, appearence of "()" and "/".
    '''
    if len(terms) == 2:
        terms = [str.lower(term) for term in terms]
    term1, term2= [terms[0]], [terms[1]]
    if r'/' in term1[0]:
        term1 = term1[0].split(r'/')
    if ('(' in term2) and (')' in term2):
        term2 = term2.split('(').strip(')')

    for term in term1:
        if term in term2:
            answer = True
        
    return [term1, term2], True
see_similarity(['Psychology/psychiatry', 'Psychology'])


Out[75]:
([['psychology', 'psychiatry'], ['psychology']], True)

In [12]:
from disciplines.theory import web_of_science_categories
wos = set(web_of_science_categories.categories)

In [90]:
g_biglan = nx.Graph()
#sudedam visas disciplinas, surasom ju savybes
#jei savybes atitinka dedame svori.
#atitinka viena savybe - 1
#atitinka dvi savybes - 2
#atitinka trys savybes - 3
for x in biglan.the_classification:
    for discipline in x[0]:
        g_biglan.add_node(discipline, x[1])
        
#two ways to go: 
##    connect nodes to node that represents value or
##    Connect nodes that have same value.
# Connecting to node that represent value

for node, info in g_biglan.nodes_iter(data=True):
    if info['life'] == True:
        g_biglan.add_edge(node, 'life')
    if info['life'] == False:
        g_biglan.add_edge(node, 'non-life')
    if info['pure'] == True:
        g_biglan.add_edge(node, 'pure')
    if info['pure'] == False:
        g_biglan.add_edge(node, 'applied')
    if info['hard'] == True:
        g_biglan.add_edge(node, 'hard')
    if info['hard'] == False:
        g_biglan.add_edge(node, 'soft')

import matplotlib.pyplot as plt

plt.figure(figsize=(12,12))
nx.draw(g_biglan)


Comparison

Now as we have datasets we can go into comparing them. By comparison we meet few things.

  1. Do sets contain the same disciplines.
  2. Do sets contain similar disciplines.
  3. Do both sets have similar structure?
  4. How similar sets are in size?

In [76]:
overlaping = unique_disciplines_in_biglan & unique_disciplines_in_consensus
a_but_not_b  = unique_disciplines_in_biglan - unique_disciplines_in_consensus
b_but_not_a  = unique_disciplines_in_consensus - unique_disciplines_in_biglan

In [77]:
def compare_two_sets_of_disciplines(set1, set2):
    """This function compares sets and return items that:
    
    Args:
        set1 (set):
        set2 (set):
    Vars:
        discipline1 (str):
        discipline2 (str):
        
    Returns:
        answer(dict): that has three keys: same, similar and different
    """
    answer = {'same': set([]), 
              'similar': set([]), 
              'different': set([])}
    for discipline1 in set1:
        for discipline2 in set2:
            if discipline1 == discipline2:
                answer['same'].add(discipline1)
            elif (discipline1 in discipline2) or (discipline2 in discipline1):
                answer['similar'].add((discipline1, discipline2))
    return answer

In [78]:
# Change to permutations
compared = []
compared.append(compare_two_sets_of_disciplines(unique_disciplines_in_biglan, unique_disciplines_in_consensus))

In [79]:
import networkx as nx
g = nx.Graph()
g.add_nodes_from(compared[0]['same'])
g.add_edges_from(compared[0]['similar'])
%matplotlib inline
nx.draw(g)

Now we have two ways extending the code. First, we can see what kind of

Let's connect the two!

Now we will merge two graphs, two categorization. They come from very different schools.

  • put a big weight between nodes if they have the same name -
  • they share some weight towards all the nodes that are connected by its pair if nodes have same name,
  • put a node between them if nodes have similar name - But before we put our ideas to work lets see what happens if we just compose the two graphs.

In [32]:
plt.figure(figsize=(10,10))
first_composition = nx.compose(g, g_biglan)
nx.draw(first_composition)



In [33]:
g_composed = nx.Graph()
new_nbunch = []
new_ebunch = []

set1 = [node.rstrip(' ') for node, data in g.nodes_iter(data=True)]
set2 = [node for node, data in g_biglan.nodes_iter(data=True)]
#all_nodes = g_composed.nodes()

def find_similar(set1, set2):
    answer = []
    for node in set1:
        for node1 in set2:
            if ((node in node1) or (node1 in node)) and (node != node1):
                answer.append((node, node1, {'weight':100}))
    return answer

temporary_result = find_similar(set1, set2)  
first_composition.add_edges_from(temporary_result)
plt.figure(figsize=(10,10))
nx.draw(first_composition)


Next, add list of humanities and social sciences


In [17]:
from disciplines.theory import Zhang
zhang = Zhang.columns

a list of of dictionaries that has string as key and list of disciplines as values. Problems: each key is a set of distinct disciplines. Second problem, some disciplines contained in a list are actually


In [18]:
import networkx as nx
g= nx.Graph()

little_bunch = []
for key in zhang:
    for key1 in key.keys():
        for discipline in key.values():
            for subdiscipline in discipline:
                little_bunch.append((subdiscipline, key1, {'source':'Zhang'}))
        #print key1.split(',') # if we will want to split 

 
g.add_edges_from(little_bunch)

In [47]:
def create_relations_based_on_similar_named(graph):
    """Takes a nx.graph object and creates relations based on similar names.
    
    Notes:
        This is complicated because some names are separated with "," and similar
        separators
        Other way to do similar thing is to retriever keywords engineering, humanities
        multidisciplinary, psychology and etc."""
            
    answer = []
    count = 0
    for node in g.nodes():
        for node1 in g.nodes():
            if ',' in node1:
                for separated_node in node1.split(','):
                    if (separated_node in node) and (node1, node) not in answer:
                        answer.append((node1, node))
            count += 1
    return answer
ebunch = create_relations_based_on_similar_named(g)

In [53]:
[len(x) for x in ebunch]


Out[53]:
[2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2]

In [57]:
import networkx as nx
g_zhang_2 = nx.Graph()
g_zhang_2.add_edges_from(ebunch)

In [65]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
nx.draw(g_zhang_2)


Add some more? Web of science


In [66]:
compared.append(compare_two_sets_of_disciplines(unique_disciplines_in_consensus, wos))
compared.append(compare_two_sets_of_disciplines(unique_disciplines_in_biglan, wos))