This notebook analyses a dataset of Arxiv papers for our paper analysing Deep Learning as a GPT and its general-purpose-technology dimensions
Activities
In [143]:
%matplotlib inline
#Some imports
import matplotlib.patches as mpatches
import random
#Imports
#Key imports are loaded from my profile (see standard_imports.py in src folder).
#Paths
#Paths
top = os.path.dirname(os.getcwd())
#External data (to download the GRID database)
ext_data = os.path.join(top,'data/external')
#Interim data (to place seed etc)
int_data = os.path.join(top,'data/interim')
#Proc data (to place seed etc)
proc_data = os.path.join(top,'data/processed')
#Figures
fig_path = os.path.join(top,'reports/figures')
#Models
mod_path = os.path.join(top,'models')
#Get date for saving files
today = datetime.datetime.today()
today_str = "_".join([str(x) for x in [today.day,today.month,today.year]])
In [144]:
# Basic functions
#Flatten list
def flatten_list(a_list):
'''
Utility function that takes a list of nested elements and flattens it
'''
return([x for el in a_list for x in el])
def flatten_freqs(a_list):
'''
Utility function that flattens a list and returns element frequencies
'''
#This flattens the list and returns value counts
return(pd.Series(flatten_list(a_list)).value_counts())
#Functions
def create_lq_df(df,year=None):
'''
Takes a df with cells = activity in col in row and returns a df with cells = lq
'''
area_activity = df.sum(axis=0)
area_shares = area_activity/area_activity.sum()
lqs = df.apply(lambda x: (x/x.sum())/area_shares, axis=1)
if year!=None:
lqs['period'] = year
return(lqs)
We have two datasets:
For now we will focus on number one. We are thinking of number 2 as more of a robustness checks variable.
In [145]:
# Load the papers
papers = pd.read_json(
ext_data+'/corex_matched_noOAG.json',orient='records')
papers.shape
Out[145]:
There are 247,931 papers. This is total number of CS + Stats papers in Arxiv CS papers (see methodology in paper for a description of the process used for this)
In [146]:
#Note that there are many duplicated papers. This seems to be linked to the way they have been geocoded
len(set(papers.arxiv_id))
Out[146]:
In [147]:
#The only difference between papers I can find is in the latitude and longitude.
#TODO: Check with Joel what's going on here.
#for column in papers.columns:
# print(column)
# print(papers.loc[0,column])
# print(papers.loc[1,column])
#
# print('\n')
In [148]:
#Check the data
papers.head()
Out[148]:
We have a big bunch of columns with topics (TOPIC_...
) and metadata from arxiv
Let's check waht the latter are
In [149]:
#Check the col
for column in papers.columns:
if 'TOPIC' not in column:
print(column)
The data is a timestamp so we parse it, and extract the year
In [150]:
papers['date_parsed'] = [datetime.datetime.fromtimestamp(x/1000) for x in papers['date']]
papers['year'] = papers['date_parsed'].apply(lambda x: x.year)
In [151]:
#How many categories are there by paper?
papers.arxiv_categories.apply(lambda x: len(x)).describe()
Out[151]:
In [152]:
flatten_freqs(papers.arxiv_categories).head(n=10)
Out[152]:
Strong presence of mathematics papers here - even though we are getting the CS papers
In [153]:
papers.language.value_counts().head(n=10).plot.bar(color='blue')
Out[153]:
Not all papers are in English? What does this meaan?
In [154]:
papers.loc[papers['language']=='en@@@ja','full_title'].head()
Out[154]:
They are in English
We want to make the data easier to work with.
We will create a df with the topics and an index with the arxiv_id
and another with other variables we are interested in.
Before doing that we need to deduplicate the papers. The analysis we are doing right now focuses on the diffusion of ML in other topics, for which we don't need individual institution information.
In [155]:
#papers_un means papers unique
papers_un = papers.drop_duplicates('arxiv_id')
papers_un.shape
Out[155]:
In [156]:
#Create a topic df only including those variables referring to a topic
topics = papers_un.set_index('arxiv_id')[[x for x in papers_un.columns if 'TOPIC' in x]]
topics.head()
Out[156]:
In [157]:
#These are the topics. The neural network topic are 13 and 36. Seems to be picking up more generic machine learning stuff.
for num,x in enumerate(topics.columns):
print(str(num)+' '+x)
The DL topics are 13 and 36
In [158]:
topics.columns[[13,36]]
Out[158]:
Question to consider
In [159]:
#And now we create a paper metadata df.
# Note that we have to set the index after subsetting (otherwise the subsetting can't find the arxiv id in the columns!)
papers_meta = papers_un[[x for x in papers_un.columns if 'TOPIC' not in x]].set_index('arxiv_id')
papers_meta.head()
Out[159]:
In [160]:
#Select columns of interest
my_columns = ['arxiv_categories','arxiv_raw_summary','arxiv_title', 'citations','year','full_title','journal']
#These columns are picking up the description of the papers, the institutes involved, the journal and the year.
#I need all these things for the analysis of 'diffusion' which is coming up.
papers_meta = papers_meta[my_columns]
papers_meta.head()
Out[160]:
Our first stage is a descriptive analysis of DL activity: in order to do this, we need to combine the paper data and the topic mix data and then label papers based on the relative importance of the DL topics. We will then plot some descriptives.
We will start with a class that classifies papers depending on the presence of DL topics. Since we have two topics and it is not straightforward to combine coefficients into a single 'DL weight', we will classify the papers twice and then combine all the information to generate a DL variable.
In [161]:
class DlPaperClassification():
'''
The class takes a paper df, a topic mix and an index for the topics which contain DL.
It has a .label_papers method that takes the topic mix and categories papers into DL groups.
It also generates a categorical variable indicating if a paper is 'specialist' (dl is top category) or
embedded (dl is simply present)
'''
def __init__(self,papers,topic_mix,dl_var):
'''
Initialise the class with a papers file,
A topic mix file and a list of DL categories.
'''
#NB we will have
self.papers = papers
self.topics = topic_mix
#This can be more than one
self.dl_var = dl_var
def label_papers(self,thres=0.2):
'''
We label papers into different levels of DL activity based on the weight
in their topic mix
-present if it is above a certain threshold
-top if it is the top topic (not necessarily above 0.5)
'''
#Load all the information we need for the analysis
papers = self.papers
topics = self.topics
dl_var = self.dl_var
#Classify papers into categories
#Is the DL topic present?
dl_present = pd.Series(topics[dl_var].apply(lambda x: x>thres),
name='dl_present')
#Is the DL topic the biggest one?
dl_max = pd.Series(topics.idxmax(axis=1)==dl_var,name='dl_spec')
#Concatenate all categories and index them (to concatenate with the papers in a moment)
dl_all_class = pd.concat([dl_present,dl_max],axis=1)
#We define an 'embed' category if papers have dl presence but are not specialised
dl_all_class['dl_embed'] = (dl_all_class['dl_present']==True) & (dl_all_class['dl_spec']==False)
dl_all_class.index = topics.index
#Concatenate papers and our topic classification
papers_lab = pd.concat([papers,dl_all_class],axis=1)
#And give them a categorical variable depending on whether they are specialist or embedded
papers_lab['dl_category'] = ['dl_spec' if x==True else 'dl_embed' if y==True else 'not_dl' for
x,y in zip(papers_lab['dl_spec'],papers_lab['dl_embed'])]
#Save outputs
#Labels stores the labels we have created
self.labels = dl_all_class
#Papers_lab stores the paper metadata labelled
self.papers_lab = papers_lab
#topics_agg stores the aggregated topics (mostly for checking)
#self.topics_agg = topic_aggregated
return(self)
In [162]:
#Run the analysis for both classes
dl_vars = [
'TOPIC_learning_neural_neural network_training_machine learning_classification_trained_machine_learn_learning algorithm',
'TOPIC_state art_art_state_deep_convolutional_convolutional neural_convolutional neural network_deep learning_datasets_deep neural']
#Each of the elements in dl classified is the output of the classification for a topic
dl_classified = [DlPaperClassification(papers_meta,topics,var).label_papers().labels for var in dl_vars]
In [163]:
#These are the totals for both categories
pd.concat([dl_classified[0].sum(),dl_classified[1].sum()],axis=1)
Out[163]:
In [164]:
#We create two lists of dl papers: one that appears in either topic (expansive) and one that appears in both (restrictive)
#Expansive (is in both)
papers_expansive = dl_classified[0].loc[(dl_classified[0]['dl_present']==True) | (dl_classified[1]['dl_present']==True)].index
#Restrictive (is only in one)
papers_restrictive = dl_classified[0].loc[(dl_classified[0]['dl_present']==True) & (dl_classified[1]['dl_present']==True)].index
print(len(papers_expansive))
print(len(papers_restrictive))
In [165]:
#Percentage of DL papers in the total
len(papers_expansive)/len(papers_meta)
Out[165]:
In [166]:
# Now we want to explore those papers
def sense_checker(data,text_var,sample_size=10,text_size=300):
'''
This function takes a dataset, draws random samples from it and prints the text so we can sense-check the quality of the matches.
'''
#Draw a random sample of size sample size from the data
drawn = random.sample(list(data.index),sample_size)
#return(data.loc[drawn])
#For each element we have drawn from the sample we print the text variable up to the parameter length
for obs in data.loc[drawn][text_var]:
print(obs[:text_size])
print('\n')
In [167]:
print('Expansive')
print('=========')
sense_checker(papers_meta.loc[papers_expansive],text_var='arxiv_raw_summary')
print('Restrictive')
print('=========')
sense_checker(papers_meta.loc[papers_restrictive],text_var='arxiv_raw_summary')
The results for both analyses look fine. We will stick with the expansive definition for now (more data)
In [168]:
papers_meta.citations.quantile(0.9)
Out[168]:
In [169]:
#Functions used in the class
def get_cited_papers(data,citation_var,q=0.75):
'''
This function subsets a dataset returning the most cited papers of the period (based on the citation variable and the quantile)
'''
#Get the quantile
citation_quantile = papers[citation_var].quantile(q)
#Subset the data
subset_data = data.loc[data[citation_var]>=citation_quantile]
return(subset_data)
In [256]:
class DlPaperAnalysis_GPT():
'''
This class generates descriptive analyses informing our first research question: Is DL a GPT.
It does so with three methods:
.is_growing produces a timeseries comparing levels of activity in DL papers versus the total
.is_spreading estimates the diffusion of DL papers in different fields
.is_impactful estimates the citation rates for papers in different fields
'''
def __init__(self,papers,dl_ids):
''''
This function is initialised with the full set of papers and the ids of DL papers
'''
#We label the data with the ids
papers['is_dl'] = ['dl' if x in dl_ids else 'not_dl' for x in papers.index]
#Store the information
self.papers = papers
#Also store the DL ids although I doubt we will do much with them
self.dl_ids = dl_ids
#Extract categories (we are only interested in computer science or statistics / ML)
categories = [x for x in set(flatten_list(papers.arxiv_categories)) if (x[:2]=='cs') | (x=='stat.ML')]
self.categories=categories
def is_growing(self,ax,year_lims=(2000,2018),thres_year=2012,high_cited=False):
'''
This method charts levels of activity in DL and compares the importance of DL before / after a threshold year
We also give it:
-year_lims to subset the x axis
-thres_year to compare the importance of DL before/after the threshold year
-high_cited subsets the data to focus on the most highly cited papers each year (its value represents the
position in the distribution)
'''
#Load papers
papers = self.papers
#Subset if we are focusing on highly cited papers
if high_cited!=False:
#This loops over years and extracts the top cited papers
papers = pd.concat([get_cited_papers(papers.loc[papers.year==x,:],'citations',high_cited) for x in np.arange(year_lims[0],year_lims[1])])
#######################
#1. Create a timeseries
#######################
#Create timeseries (note we are subsetting this table with the year lims)
papers_year = pd.crosstab(papers['year'],papers['is_dl']).loc[year_lims[0]:year_lims[1]]
#Plot
papers_year.plot.bar(stacked=True,ax=ax)
#Add titles etc
if high_cited==False:
title = 'Number of papers in ArXiv (DL / non DL), \n {y0}-{y1}'.format(y0=str(year_lims[0]),y1=str(year_lims[1]))
else:
title = 'Number of papers in ArXiv (DL / non DL), \n {y0}-{y1} (top {q} citations in year)'.format(y0=str(year_lims[0]),y1=str(year_lims[1]),
q=str(100*high_cited)+'%')
ax.set_title(title,size=14)
#Store information
self.papers_year = papers_year
#############################
#2. Before / after comparison
###############################
#Crosstabs a boolean indicating if the year is before / after the threshold and normalise over the rows
ct = pd.crosstab(papers['year']>thres_year,papers['is_dl'],normalize=0)
#We want to relabel the index of the crosstab to make the output more readable
y = str(thres_year)
ct.index=['Before '+y, 'After '+y]
self.dl_shares_change= ct
def is_spreading(self,ax,year_lims=(2000,2017),thres_year=2012,high_cited=False,pop_categories=False):
'''
This method charts the diffusion of DL across domains.
One annoying aspect of this is that the papers have multiple categories with no weights.
We will expand the data and consider categories separately.
pop_categories allows us to focus on the more popular categories of activity where we expect our share estimates to be more robust.
#What are the key outputs:
#Line chart representing DL papers as a share of total in papers with different categories
#Line chart comparing DL papers as a share of total in different categories before / after threshold.
#Note that the ax argument has two elements for the two figures we are drawing.
'''
#Load papers
papers = self.papers
#Subset if we have decided to focus on highly cited papers
if high_cited!=False:
#This loops over years and extracts the top cited papers (should probably turn this into a function)
papers = pd.concat([get_cited_papers(papers.loc[papers.year==x,:],'citations',high_cited) for x in np.arange(year_lims[0],year_lims[1])])
#If we are filtering to focus on popular categories
if pop_categories!=False:
#This extracts the top categories based on their frequency of appearance in the data
categories = flatten_freqs(papers.arxiv_categories)[self.categories][:pop_categories].index
#######
#1. Create linechart of activity by category
########
#We create a couple of containers to store the data
#Share container stores the share of DL in total (we will use this for plotting)
cat_share_container =[]
#Cat total container stores the totals for each category. We use a dict for this
cat_total_container = {}
#We loop over each category of interest
for cat in categories:
#Subset the data to identify papers with the category
subset = papers.loc[[cat in x for x in papers['arxiv_categories']],:]
#We crosstab year vs dl categorical
subset_year = pd.crosstab(subset['year'],subset['is_dl'])
#Store the totals
cat_total_container[cat] = subset_year
#If there are any DL papers at all
if 'dl' in subset_year.columns:
#Calculate the share of DL papers
subset_year['share'] = subset_year['dl']/subset_year.sum(axis=1)
#We only output the share as a series named after the category (this will become the column name when we concatenate latewr)
out = pd.Series(subset_year['share'],name=cat)
#Out it comes
cat_share_container.append(out)
#Create the df filling nas and focusing on our threshold years
category_share_df = pd.concat(cat_share_container,axis=1).fillna(0).loc[year_lims[0]:year_lims[1]]
#Now we plot this.
#Note that we are assuming that there are too many variables for a legend. We will probably create a cleaner version with nicer labels later.
category_share_df.rolling(window=3).mean().plot(legend=False,color='mediumblue',alpha=0.7,ax=ax[0])
ax[0].set_title('DL paper shares by ArXiv categories',size=14)
ax[0].set_ylabel('Share of all papers in category /year')
#Store results
self.cat_totals = cat_total_container
self.cat_shares = cat_share_container
self.cat_shares_df = category_share_df
#########
#2. Create barchart comparing two intervals
#########
cat_period_container = []
#As before, we loop over categories.
for cat in categories:
#Subset the data to identify papers with the category
subset = papers.loc[[cat in x for x in papers['arxiv_categories']],:]
#We crosstab a boolean (before / after threshold) vs the dl boolean
subset_ct = pd.crosstab(subset['year']>thres_year,subset['is_dl'],normalize=0)
#This is to relabel the index (useful for the chart later)
y = str(thres_year)
subset_ct.index=['Before '+y, 'After '+y]
#We append to the container, turning into a series so we can rename
cat_period_container.append(pd.Series(subset_ct['dl'],name=cat))
#Create the df
cat_thres_df = pd.concat(cat_period_container,axis=1).T.sort_values('After '+y,ascending=False)
cat_thres_df.plot.bar(ax=ax[1])
ax[1].set_title('Change in DL shares before/after '+str(thres_year),size=14)
ax[1].set_ylabel('Share of all papers in category/year')
#Store the df
self.cat_thres_df = cat_thres_df
def is_impactful(self,ax,q=0.75,year_thres=2012,pop_categories=False):
'''
Finally, we want to check if DL papers are 'impactful' - do they tend to receive more citations than other papers in each field?
To measure this we will estimate, for each category, what is the share of DL papers in total vs share of highly cited Dl papers.
We focus on papers published from a threshold year to avoid being skewed by changes in the historical distribution of papers.
'''
#Load papers and categories
papers = self.papers
categories = self.categories
cit_cont=[]
#If we are filtering to focus on popular categories
if pop_categories!=False:
#This extracts the top categories based on their frequency of appearance in the data
categories = flatten_freqs(papers.loc[papers.year>year_thres,'arxiv_categories'])[categories][:pop_categories].index
#For each category
for cat in categories:
#Here we have the papers since threshold (eg 2012) in the category
subset = papers.loc[(papers.year>year_thres) & ([cat in x for x in papers['arxiv_categories']])]
#Share of dl in all papers
dl_all = subset['is_dl'].value_counts(normalize=True)['dl']
#Share of dl in highly cited papers
#We use a previous function to subset this
subset_high_cited = get_cited_papers(subset,'citations',q)
dl_high_cited = subset_high_cited['is_dl'].value_counts(normalize=True)['dl']
#out = pd.Series([dl_all,dl_high_cited],index=['dl_share_all','dl_share_high_cited'],name=cat)
#We output an index which normalises the share of high cited papers by the total.
#It is positive if DL papers are overrepresented amont the highly cited ones
out = pd.Series((dl_high_cited/dl_all)-1,index=['high_cited_total_ratio'],name=cat)
cit_cont.append(out)
#Create citation df
citation_df = pd.concat(cit_cont,axis=1).T
#And plot it
citation_df.sort_values('high_cited_total_ratio',ascending=False).plot.bar(ax=ax,legend=False)
#Add title
ax.set_title('DL paper citation \'competitiveness\' \n (papers published after {y}, top {q} citations in period))'.format(
y=str(year_thres),q=str(100*q)+'%'))
#And x label
ax.set_ylabel('(DL papers share of highly cited/ \n DL papers share of all)-1')
#Store the df
self.citation_impact_df = citation_df
In [257]:
test = DlPaperAnalysis_GPT(papers_meta,papers_expansive)
fig,ax = plt.subplots(figsize=(5,3))
test.is_growing(ax=ax,year_lims=(2005,2018))
ax.set_title('')
ax.set_ylabel('Number of papers')
ax.set_xlabel('')
ax.legend(title='Category',labels=['Deep Learning','Not Deep Learning'])
plt.tight_layout()
plt.savefig(fig_path+'/paper_figures/figure_1_trends.pdf')
Fast increase of activity in ArXiv.
DL appears to be growing at a faster rate, consistent with the 'rapidity' thesis
In [280]:
test.dl_shares_change
Out[280]:
In [259]:
test2 = DlPaperAnalysis_GPT(papers_meta,papers_expansive)
fig,ax = plt.subplots(figsize=(7,3))
test2.is_growing(high_cited=0.75,ax=ax)
test2.dl_shares_change
Out[259]:
In [260]:
fig,ax = plt.subplots(nrows=2,figsize=(7,6))
test.is_spreading([ax[0],ax[1]],pop_categories=35,year_lims=(2005,2018))
plt.tight_layout()
plt.savefig(fig_path+'/paper_figures/figure_2_shares.pdf')
DL is becoming more important in multiple disciplines. This includes disciplines that specialise in the development of AI technologies (eg cs.NE
= neural networks, or cs.AI
= AI) but also others such as Computer Vision, Computation and Language, or Information Retrieval or graphics.
TODO What's the discipline with the 'bump' around 2014?
In [261]:
fig,ax = plt.subplots(figsize=(7,4))
test.is_impactful(ax,year_thres=2012,pop_categories=35,q=0.75)
ax.set_ylabel('Citation competitiveness index \n $CC_i$')
ax.set_title('')
plt.tight_layout()
plt.savefig(fig_path+'/paper_figures/figure_3_impact.pdf')
DL papers are overrepresented in the set of influential papers for most CS disciplines with only a few exceptions (software engineering and performance)
Note that some of the most popular topics for DL (in chart above) are quite low in these rankings because DL papers represent the majority of papers in them already
Having studied the development and diffusion of DL, we want to analyse their geography. What are our hypotheses here?
Our hypothesis is that there has been a disruption in the geography of DL: a change in the relative specialisations of countries.
How do we analyse this?
As before, we will write a class to do this.
NB see here for a quick tutorial on spatial joins
In [176]:
#Alas, we don't have countries in these data.
import geopandas as gp
from shapely.geometry import Point
In [177]:
#Read the shapefile
admin_shape = gp.read_file(ext_data+'/admin_shapefile/ne_10m_admin_1_states_provinces.shp')
admin_shape['country_reg'] = [str(x)+'_'+str(y) for x,y in zip(admin_shape.iso_a2,admin_shape.name_en)]
In [178]:
#We will use a spatial join. To do this we need to create a geopandas df with the spatial coordinates
#for each paper. We will create an individual paper id for each paper-institution pair so it's straightforward to
#merge things later
papers['paper_id'] = ['id_'+str(num) for num in np.arange(0,len(papers))]
#We create a geo papers df with the lat lon
geo_paper = papers.set_index('paper_id')[['grid_lat','grid_lon']]
In [179]:
#Some of the papers here have multiple lat lons - they are from institutions with multiple locations.
#We will drop them from now.
geo_paper = geo_paper.loc[[len(x)==1 for x in geo_paper['grid_lat']]]
#Also drop papers with 'none' l
geo_paper = geo_paper.loc[[x[0]!=None for x in geo_paper['grid_lat']]]
geo_paper = geo_paper.dropna(subset=['grid_lat'])
In [180]:
len(geo_paper)-len(papers)
Out[180]:
We lose 24,000 observations. Check with Joel what to do with these
In [181]:
#Now we turn the lat and lon into coordinates
paper_points = geo_paper.apply(lambda x: Point([x['grid_lon'][0],x['grid_lat'][0]]),axis=1)
#And create the geodataframe
papers_geodf = gp.GeoDataFrame(geo_paper,geometry=paper_points)
In [182]:
#Make sure we have the same coordinates
papers_geodf.crs= admin_shape.crs
In [183]:
#And do the spatial join - the operation indicates that we are doing a point in polygon.
papers_geographies = gp.sjoin(papers_geodf,admin_shape,op='within')
In [184]:
#Focus on the variables we are interested in (country and region)
papers_geo_short = pd.DataFrame(papers_geographies[['admin','name_en','country_reg']])
In [185]:
#Merge with the papers df
papers_all= papers.set_index('paper_id').join(papers_geo_short,how='left')
In [186]:
#Create the papers df for spatial analysis
#Variables of interest
my_vars = ['arxiv_id','title','arxiv_raw_summary','arxiv_categories',
'journal','citations','institutes',
'grid_lat','grid_lon','admin','name_en','country_reg','year']
papers_spat = papers_all[my_vars].dropna(subset=['name_en'])
#Remove all observations with empty geocodes
papers_spat = papers_spat.loc[[len(x)>0 for x in papers_spat['admin']]]
papers_spat['grid_lat'],papers_spat['grid_lon'] = [[x[0] for x in papers_spat[variable]] for variable in ['grid_lat','grid_lon']]
In [187]:
papers_spat.rename(columns={'name_en':'region','admin':'country'},inplace=True)
In [188]:
len(papers_spat)-len(papers)
Out[188]:
We have lost a few more (2k) observations that had missing country information
Now we write a class that will address our spatial questions:
In [189]:
def get_high_cited_year(data,q,year_lims):
'''
This function extracts high cited papers by year (to control for citation times).
TODO - put this function in all classes above
'''
#This loops over the years and extracts papers in the top quantile of activity.
out = pd.concat([get_cited_papers(data.loc[data.year==x,:],'citations',
q) for x in np.arange(year_lims[0],year_lims[1])])
return(out)
def calculate_herfindahl(series):
'''
This function takes a series and returns its herfindahl index (the sum squared of the share of each observation in the total)
'''
herf = np.sum([(x/np.sum(series))**2 for x in series])
return(herf)
def sort_shares_for_concentration_plot(df,cols):
'''
This function takes a df with shares of activity by area and returns a df ready for plotting to analyse concentration
focusing on the columns of interest
'''
totals_sorted = pd.concat([
df.sort_values(col,ascending=False)[col].reset_index(drop=True) for col in cols],axis=1)
shares_sorted = totals_sorted.apply(lambda x: x/x.sum()).cumsum()
return(shares_sorted)
def concentration_analysis(papers_df,level):
'''
This function takes a papers df and a level of analysis performs a concentration analysis which returns
a herfindahl index which returns a concentration index for the level of activity, and a df with cumulative shares
of activity to visualise in a plot.
'''
#Calculate totals by category (DL and not DL)
totals_by_cat = pd.pivot_table(papers_df.groupby([level,'is_dl']).size().reset_index(),
index=level,columns='is_dl',values=0).fillna(0)
#And categories for the totals
totals_by_cat['total'] = totals_by_cat.sum(axis=1)
#Calculate Herfindahl with a function we defined before. We are only interested in DL and the total benchmark
herf = totals_by_cat.apply(lambda x: calculate_herfindahl(x))[['dl','total']]
#Store the herfindahl indices
#To visualise these columns we creata
shares_sorted = sort_shares_for_concentration_plot(totals_by_cat,['dl','total'])
return([herf,shares_sorted])
In [190]:
class DlPaperAnalysis_Spatial():
'''
This class implements the following methods:
.shares compares dl shares with wider shares (a way of visualising LQs). This works with regions and countries
.concentration compares dl geographical concentration with all papers
.concentration_change plots changes of concentration before and after a threshold period
.spec_change compares changes in specialisation before/after a threshold. We could also compare it with a reference field?
.clustering performs the dbscan analysis
'''
def __init__(self,papers,dl_ids):
''''
This class is initialised with the full set of papers and the ids of DL papers
'''
#We label the data with the ids
papers['is_dl'] = ['dl' if x in dl_ids else 'not_dl' for x in papers.arxiv_id]
#Store the information
self.papers = papers
#Also store the DL ids although I doubt we will do much with them
self.dl_ids = dl_ids
#Extract categories (we are only interested in computer science or statistics / ML)
categories = [x for x in set(flatten_list(papers.arxiv_categories)) if (x[:2]=='cs') | (x=='stat.ML')]
self.categories=categories
def shares(self,ax,unit='country',high_cited=False,top_ranking=10,year_lims=[2007,2018]):
'''
This function plots shares of total papers and share of DL papers by location. As in previous
classes, we can focus it on highly cited and only plot high activity locations.
'''
#Load papers
papers = self.papers
#If we want to focus on high cited, apply the high cited function
if high_cited!=False:
papers = get_high_cited_year(papers,high_cited,year_lims)
#Now we create a df with total shares
total_shares = papers[unit].value_counts(normalize=True)
#And nother with DL shares
dl_shares = pd.crosstab(papers[unit],papers['is_dl'],normalize=1)['dl']
#Concatenate them
all_shares = pd.concat([total_shares,dl_shares],axis=1).sort_values(unit,ascending=True)
#Name columns
all_shares.columns = ['all_papers_share','dl_papers_share']
#store results
self.shares_activity = all_shares
#Plot
all_shares[-top_ranking:].plot.barh(ax=ax)
if high_cited==False:
title = 'Share of total and DL papers by {unit}, \n {y0}-{y1}'.format(y0=str(year_lims[0]),y1=str(year_lims[1]),
unit=unit)
else:
title = 'Share of total and DL papers by {unit}, \n {y0}-{y1} (top {q} citations in year)'.format(
y0=str(year_lims[0]),y1=str(year_lims[1]),q=str(100*high_cited)+'%',unit=unit)
return(self)
def concentration(self,ax,unit='country',high_cited=False):
'''
This method estimates three things for the selected unit of analysis:
-Herfindahl for the whole interval and before/after the 2012 threshold
-Shares of activity by location in a table and a curve.
-Shares of activity by location (change)
'''
#This is copied and pasted from the above. TODO: refactor
#Load papers
papers = self.papers
#If we want to focus on high cited, apply the high cited function
if high_cited!=False:
papers = get_high_cited_year(papers,high_cited,[np.min(papers.year),np.max(papers.year)])
#We run the previously defined function,
#which turns a df of totals into a df ready to be plotted for a concentration analysis
conc = concentration_analysis(papers,level=unit)
#The concentration_analysis function returns a list with two elements: herfindahl indices and shares_df for plotting
self.herf = conc[0]
shares_sorted = conc[1]
ax.plot(shares_sorted)
ax.legend(['dl','total'])
#And (ugly) title
if high_cited==False:
title = 'Concentration of total and DL papers by {unit}'.format(unit=unit)
else:
title = 'Concentration of total and DL papers by {unit}, \n (top {q} citations in year)'.format(
q=str(100*high_cited)+'%',unit=unit)
ax.set_title(title,size=14)
#Add labels
ax.set_xlabel('Rank')
ax.set_ylabel('Cumulative share of \n activity')
def concentration_change(self,ax,unit='country',high_cited=False,threshold=2012,hline=0.75):
'''
This is a quite similar to the analysis above but splitting the papers into two groups (before and after the threshold)
'''
#Load papers
papers = self.papers
#If we want to focus on high cited, apply the high cited function
if high_cited!=False:
papers = get_high_cited_year(papers,high_cited,[np.min(papers.year),np.max(papers.year)])
#We run the previously defined function,
#which turns a df of totals into a df ready to be plotted for a concentration analysis
#But in this case we run it twice on different subsets of the data
conc_t0 = concentration_analysis(papers.loc[papers['year']<threshold,:],level=unit)
conc_t1 = concentration_analysis(papers.loc[papers['year']>threshold,:],level=unit)
#Store the two sets of concentration indices
self.herf = [conc_t0[0],conc_t1[0]]
self.shares_shorted = [conc_t0[1],conc_t1[1]]
shares_sorted_t0, shares_sorted_t1 = conc_t0[1],conc_t1[1]
ax[0].plot(shares_sorted_t0)
ax[1].plot(shares_sorted_t1)
#Add the legend
ax[0].legend(['dl','total'],loc='lower right')
#Add titles and labels
ax[0].set_title('Before {t}'.format(t=threshold),size=14)
ax[1].set_title('After {t}'.format(t=threshold),size=14)
ax[1].set_xlabel('Rank')
ax[0].set_ylabel('Cumulative share of \n activity')
ax[1].set_ylabel('Cumulative share of \n activity')
#Add a hline indicating at what rank do we go over 50% of the observations
#Find the right x values
x_0,x_1 = [df[df['dl']>hline].index[0] for df in [shares_sorted_t0,shares_sorted_t1]]
ax[0].hlines(y=hline,xmin=0,xmax=len(shares_sorted_t0),linestyle=':')
ax[0].vlines(x=x_0,ymin=0,ymax=1,linestyle=':')
ax[1].hlines(y=hline,xmin=0,xmax=len(shares_sorted_t1),linestyle=':')
ax[1].vlines(x=x_1,ymin=0,ymax=1,linestyle=':')
def spec_changes(self,ax,unit='country',high_cited=False,top_ranking=10,year_lims=[2007,2018],window=4):
'''
Here we compare the evolution of specialisation (relative overrepresentation of papers by country).
It is a linechart.
We add a 'window' parameter to the previous, to smooth the lines.
'''
#Load papers
papers = self.papers
#Focus on years of interest
papers = papers.loc[(papers.year > year_lims[0]) & (papers.year < year_lims[1]),:]
#If we want to focus on high cited, apply the high cited function
if high_cited!=False:
papers = get_high_cited_year(papers,high_cited,year_lims)
#Identify top locations
locations = papers[unit].value_counts()[:top_ranking].index
#Now we estimate LQs by year and paper
#This estimates the LQ
papers_year = papers.groupby('year').apply(lambda x: create_lq_df(pd.crosstab(x[unit],
x['is_dl'])))
#Now we pivot the data
spec_wide = pd.pivot_table(papers_year, index='year',columns=unit,values='dl')[locations]
#Store it
self.spec_place = spec_wide
spec_wide.rolling(window=window).mean().plot(ax=ax,linewidth=3)
ax.legend(bbox_to_anchor=(1,1),title=unit)
ax.hlines(y=1,xmin=year_lims[0]+4,xmax=year_lims[1]-1,linestyle=':')
#Add titles as before. Quite convoluted
if high_cited==False:
title = 'Comparative advantage in DL by {unit}, \n {y0}-{y1}, \n {w}-year moving averages'.format(
y0=str(year_lims[0]),y1=str(year_lims[1]),unit=unit,w=window)
else:
title = 'Comparative advantage in DL by {unit}, \n {y0}-{y1} (top {q} citations in year), \n {w}-year moving averages'.format(
y0=str(year_lims[0]),y1=str(year_lims[1]),q=str(100*high_cited)+'%',unit=unit,w=window)
#Set title
ax.set_title(title,size=14)
def spec_thres(self,ax,unit='country',high_cited=False,top_ranking=20,year_threshold=2012):
"""
This creates a barchart comparing comparative advantages before / after a threshold year
"""
#Load papers
papers = self.papers
#Add variable for subsetting
papers['threshold_year'] = ['pre_'+str(year_threshold) if y<year_threshold else 'post_'+str(year_threshold) for
y in papers['year']]
locations = papers[unit].value_counts()[:top_ranking].index
#If we are working with highly cited papers
#Split into two years, apply the get_cited_papers and combine
if high_cited != False:
papers =papers.groupby('threshold_year').apply(lambda x: get_cited_papers(
x,'citations',high_cited)).reset_index(drop=True)
#Now we calculate the LQs for both years.
lqs = papers.groupby(
'threshold_year').apply(lambda x: create_lq_df(pd.crosstab(x[unit],x['is_dl']))).reset_index(drop=False)
#This creates the table for plotting
specs_wide = pd.pivot_table(lqs,
index='threshold_year',columns=unit,values='dl')[locations].T.sort_values(
'post_'+str(year_threshold),ascending=False).plot.bar(ax=ax)
self.spec_wide = lqs
#And the labe;s
ax.legend(bbox_to_anchor=(1,1),title='Period')
ax.hlines(y=1,xmin=0,xmax=top_ranking,linestyle=':')
#Add titles as before. Quite convoluted
if high_cited==False:
title = 'Comparative advantage in DL by {unit}, \n before and after {y}'.format(y=str(year_threshold),
unit=unit)
else:
title = 'Comparative advantage in DL by {unit}, \n before and after {y} (top {q} citations in year)'.format(
y=str(year_threshold),unit=unit,q=str(100*high_cited)+'%')
#Set title
ax.set_title(title,size=14)
In [191]:
#Initialise class
test_2 = DlPaperAnalysis_Spatial(papers_spat,papers_expansive)
#Plot concentratioj
fig,ax =plt.subplots()
test_2.concentration(ax,unit='country',high_cited=False)
There is more concentration in DL than in the CS population overall
The evolution of concentration is interesting: there has been an increase of concentration at the top (the top locations have gained importance but also an 'stretching' of the middle (a decrease in concentration lower in the distribution, consistent with the idea of a broadening of activity / increase in volatility)
Now we want to focus on changes in concentration
In [192]:
#Visualise shares of activity by country
fig,ax = plt.subplots(figsize=(7,5))
test_2.shares(ax,unit='country',top_ranking=20,high_cited=False)
Out[192]:
This chart shows that some countries such as US, China, Canada and Switzerland are overrepresented in DL while others such as France. Germany and Italy are underrepresented
In [193]:
fig,ax = plt.subplots(figsize=(7,5))
test_2.spec_changes(ax,high_cited=False,year_lims=(2006,2018))
Year on year figures are quite noisy. They suggest quite a lot of volatility in DL activity although we would need to compare to another field in order to establish this for sure. This is to be done.
It would be interesting to understand what happened with Australia and China at the beginning of the period
In [194]:
fig,ax = plt.subplots(figsize=(8,5))
test_2.spec_thres(ax,unit='country',high_cited=0.5,top_ranking=25)
ax.set_xlabel('')
ax.set_title('')
ax.set_ylabel('RCA index')
ax.legend(loc='upper right')
plt.tight_layout()
plt.savefig(fig_path+'/paper_figures/figure_4_spec_change.pdf',bbox_to_inches='tight')
In [195]:
#Plot concentration changes
fig,ax =plt.subplots(nrows=2,figsize=(8,5.5),sharey=True,sharex=True)
test_2.concentration_change(ax,unit='country',high_cited=0.5,threshold=2012,hline=0.5)
#fig.suptitle('Changes in concentration')
plt.tight_layout()
plt.savefig(fig_path+'/paper_figures/figure_5_concentration_change.pdf',bbox_to_inches='tight')
In [196]:
#Compare shares of the top countries before / after 2012 (for the paper)
conc_change_dl = pd.concat([test_2.shares_shorted[0]['dl'],test_2.shares_shorted[1]['dl']],axis=1)
conc_change_dl['conc_change'] = conc_change_dl.iloc[:,1]-conc_change_dl.iloc[:,0]
conc_change_dl.head()
Out[196]:
In [197]:
test_2.herf[1]['dl']/test_2.herf[0]['dl']-1
test_2.herf[1]['total']/test_2.herf[0]['total']-1
Out[197]:
In [198]:
fig,ax = plt.subplots(figsize=(10,5))
test_2.spec_thres(ax,unit='region',high_cited=0.75,top_ranking=30)
We see some evidence of churn when we compare before / after 2012. Some countries such as China, Hong Kong, Singapore and Canada gain a lot of visibility while others such as Switzerland, Netherlands, Japan and Spain see a relative decline. When we look at the regional picture we see some spectacular changes in some places such as Beijing, Baden-Wurttemberg, New York or Ontario.
In [ ]:
The analysis above is quite coarse and does not take into account spatial patterns of concentration. We will explore that question using the DBSCAN algorithm, which identifies clusters in a data-driven way by looking for high density groups of observations within a set radius. We are interested in quantifying geograpnical disruption: how does the geography of activity change between periods?
Note that given the big changes in activity in Dl it is quite difficult to compare before/after clusters (the minimum cluster sizes and spans are likely to change). For that same reason, it is hard to compare dl with non dl.
What we will do is compare clustering trends for a 'high dl' and 'low dl' category. These are identified based on the relative importance of DL papers in them.
We will compare changes in activity between these two groups: do the clusters in high DL activity display more volatility than those in low DL activity?
Implementation
Create a class DlPaperAnalysisCluster
which estimates the clusters with a set number of parameters and generates those statistics.
We will then do grid search over different parameters and compare the results.
We initialise the class with all the data.
-We implement a .segment
method to automatically identify, inside the top X ArXiv categories, those that are 'high DL' and those that are 'low DL'.
-We implement a method called .time_cluster_comp
which compares the clusters between two periods.
-We implement a method called .disc_cluster_comp
which compares clusters in disciplines
All the results are stored so that we can map the results.
In [199]:
#Some imports and changes
from sklearn.cluster import DBSCAN
from scipy.spatial import ConvexHull
from shapely.geometry import Polygon
pd.options.mode.chained_assignment = None
In [200]:
#A bunch of functions we will need to use
def dbscan(data,coords,epsilon,min_samples):
'''
the function dbscan calculates the clusters.
coords is a list with the lon and lat
'''
#Create the matrix for estimation
#coords = data.as_matrix(columns=[coords[0],coords[1]])
coords = data[[coords[0],coords[1]]].values
#Parameters
kms_per_radian = 6371.0088
#Estimate epsilon as radians (we use the Haversine metric to measure distances in a sphere)
epsilon = 13 / kms_per_radian
#Initialise cluster
db = DBSCAN(eps=epsilon, min_samples=min_samples,
algorithm='ball_tree',
metric='haversine').fit(np.radians(coords))
return(db)
def cluster_report(data,cluster_fit):
'''
This function returns a report for the clusters
'''
#Output container
output=[]
#Number of clusters is the set of the labels -1 (removing the non-cluster label)
number_clusters = len(set(cluster_fit.labels_))-1
#And append to outputs
output.append(number_clusters)
#Label the data with the cluster labels
data['cluster'] = cluster_fit.labels_
#Remove the observations outside of clusters
data_in_cluster = data.loc[data['cluster']!=-1,:]
#Generate the point coordinates in clusters for mapping
coords_in_cluster = data_in_cluster[['grid_lon','grid_lat']]
#And append to output
output.append(coords_in_cluster)
#Generate the convex hull for each cluster
#We need to get the points for each cluster
data_in_cluster['points'] = [(c1,c2) for c1,c2 in zip(data_in_cluster['grid_lon'],
data_in_cluster['grid_lat'])]
#Also create a geopoints df so we can do the point in polygon thing later
geo_points = gp.GeoDataFrame(Point(x,y) for x,y in zip(data_in_cluster['grid_lon'],
data_in_cluster['grid_lat']))
#print(geo_points)
geo_points.set_geometry(0,inplace=True)
output.append(geo_points)
ch_store = []
#Then we group over the cluster (note that we have already excluded the no-cluster set)
for c in set(data_in_cluster['cluster']):
#We turn each of the coordinates in the cluster into an array of points
points = np.array(data_in_cluster.loc[data_in_cluster['cluster']==c,['grid_lon','grid_lat']])
#We turn those points into polygons and put them in a geoseries with the convex hull (envelope)
#This way we can look for the clusters
ch = gp.GeoSeries(Polygon(points)).convex_hull
ch_store.append(ch)
#Output this as a geodf
geo_df = gp.GeoDataFrame(ch_store)
geo_df.set_geometry(0,inplace=True)
output.append(geo_df)
#Now I want the cities in each cluster (we focus on the top 5 cities)
cities_in_clusters = data_in_cluster.groupby('cluster')['country_reg'].apply(lambda x: x.value_counts()[:5])
#And append to output
output.append(cities_in_clusters)
#Now I want the % of activity in the clusters
activity_in_cluster_share = len(data_in_cluster)/len(data)
#And append to output
output.append(activity_in_cluster_share)
#And the % of highly cited papers in clusters
high_cited_in_cluster_share = len(data_in_cluster.loc[data_in_cluster['high_cited']==True,:])/len(data.loc[
data['high_cited']==True,:])
output.append(high_cited_in_cluster_share)
return(output)
def cluster_comparison(report_1, report_2):
'''
This function checks how many points from the second cluster set are in the first cluster set and vice versa.
It is a measure of volatility. If all points in the second cluster set where in the first, then we would have perfect stability.
If the opposite, we have perfect disruption.
'''
#How many points in 2 would have been present in 1?
#geodf with period 1 polys
polys_1 = report_1[3]
#geodf with period 2 points
points_2 = report_2[2]
polys_1.crs=points_2.crs
#Do the join
spatial_join = gp.sjoin(polys_1,points_2,how='inner',op='contains')
#What percentage of papers in period 2 clusters are not in a period 1 cluster?
coverage = len(spatial_join)/len(points_2)
#What regions in period 2 clusters are not in period 1 clusters and viceversa
#Find the unique regions present in the clusters
reg_1,reg_2 = [set(rep[4].reset_index(drop=False)['level_1']) for rep in [report_1,report_2]]
#Estimate exits and entries
exits = reg_1 - reg_2
entries = reg_2 - reg_1
#Save and return
exits_entries = [exits,entries]
return([coverage,exits_entries])
In [201]:
class DlPaperAnalysisCluster():
'''
This class estimates geo clusters in a paper database a set number of parameters and generates those statistics.
We will then do grid search over different parameters and compare the results.
We initialise the class with all the data.
-.segment identifies the comparison sets: papers in high DL arXiv categories and low DL arXiv categories.
We think of them as treatments and controls.
-.cluster_changes compares changes in clustering between disciplines, over two periods
#-.disc_cluster_comp` compares clusters in disciplines
All the results are stored so that we can map the results.
'''
def __init__(self,papers):
'''
Initialise with the papers. Note that these papers have already been classified into DL / non-DL
'''
#Store papers
self.papers = papers
#Extract categories (we are only interested in computer science or statistics / ML)
categories = [x for x in set(flatten_list(papers.arxiv_categories)) if (x[:2]=='cs') | (x=='stat.ML')]
cat_freqs = flatten_freqs(papers['arxiv_categories'])[categories]
self.categories=cat_freqs.index[cat_freqs>10000]
def segment(self,pop_categories=20,levels=[0.33,0.1]):
'''
We want to identify a 'treatment' and 'control' group. The treatment are arXiv categories with high level of DL activity.
The control are arXiv categories with low level . of DL activity
'''
papers = self.papers
categories = self.categories
cat_store = []
#For each category, calculate the DL paper share
for cat in categories:
#Subset the data to identify papers with the category
subset = papers.loc[[cat in x for x in papers['arxiv_categories']],:]
#We crosstab a boolean (before / after threshold) vs the dl boolean
subset_ct = subset['is_dl'].value_counts(normalize=True)
subset_ct.name=cat
#Append
cat_store.append(subset_ct)
#Identify categories high and low cluster chare
self.cat_groups = [[x.name for x in cat_store if x['dl']>levels[0]],[x.name for x in cat_store if x['dl']<levels[1]]]
def cluster_periods(self,threshold=2012,min_samples=200,epsilon=10,high_cited=False,
citation_threshold=0.75,
is_dl=True):
'''
Inputs:
-This method identifies clusters in the data comparing before and after the threshold date.
-It takes parameters for DBSCAN (minimum number of samples and the distance)
-It focuses on all papers (we can instead set this to high cited) and dl papers
Outputs:
-Cluster labels
-Collection of points for mapping (removing 'noise clusters')
-Shares of t1 activity in t0 clusters (calculated using convex hull which gives us the envelop containing all points)
and viceversa.
'''
#Initialise papers
papers = self.papers
categories = self.cat_groups
#If we are focusing on highly cited papers
#if high_cited!=False:
# papers = get_high_cited_year(papers,high_cited,[min(papers['year']),max(papers['year'])])
#We want to label highly cited papers
high_cited_ids = get_high_cited_year(papers,citation_threshold,[min(papers['year']),max(papers['year'])]).index
papers['high_cited'] = [True if x in high_cited_ids else False for x in papers.index]
results = {}
#We loop over the two sets of categories
#For each category group (high or low categories)
for cat_group in categories:
#For each category in the group
for cat in cat_group:
#Tracker
print(cat)
#Identify papers in group
subset = papers.loc[[cat in arxiv_cat for arxiv_cat in papers['arxiv_categories']]]
#Split into papers in t0 and papers in t1. Note that this excludes the threshold year as a 'boundary'
subset_0 = subset.loc[subset['year']<threshold]
subset_1 = subset.loc[subset['year']>threshold]
#Fit the clustering algorithm for the two subsets in the data and generate the reports
db_report_0,db_report_1 = [cluster_report(subset,dbscan(subset,coords=['grid_lon','grid_lat'],
epsilon=epsilon,
min_samples=min_samples)) for subset in [subset_0,subset_1]]
#And now we want to compare the clusters before and after.
#How much activity in the second period is captured by clusters identified in the first period
compare_1 = cluster_comparison(db_report_0,db_report_1)
results[cat]=[[db_report_0,db_report_1],compare_1]
#return([db_report_0,db_report_1])
self.cluster_comparisons = results
def visualise_differences(self,ax):
'''
Here we want to visualise some of the differences in the data.
Boxplots that compare:
-Share of activity in period 2 captured by clusters from period 1.
-Share of activity in period 1 captured by clusters from period 2.
-Cluster difference between period 1 and period 2
-Increase in the share of activity captured by clusters in different periods
-Increase in the share of citations captured by clusters in different periods
'''
#Load information
comps = self.cluster_comparisons
high_dl_cats = self.cat_groups[0]
#Compare t0 coverage in t1 between both groups.
discs = pd.Series({k:v[1][0] for k,v in comps.items()})
#Compare change in cluster numbers between period 1 and period 2
n_change = pd.Series({k:v[0][1][0]/v[0][0][0] for k,v in comps.items()})
#Compare change in concentration accounted by top clusters
conc_change = pd.Series({k:v[0][1][-2]/v[0][0][-2] for k,v in comps.items()})
#Compare change in high citations accounted by top clusters
cit_change = pd.Series({k:v[0][1][-1]/v[0][0][-1] for k,v in comps.items()})
#Concatenate in a single df.
cluster_comp_df = pd.concat([discs,n_change,conc_change,cit_change],axis=1)
self.cluster_comp = cluster_comp_df
cluster_comp_df.columns = [
'Initial cluster coverage','Cluster expansion','Concentration change','Citation concentration \n change']
#Label the df
cluster_comp_df['high_dl'] = [x in high_dl_cats for x in cluster_comp_df.index]
#Plot all these in 4 rows
ax[0][0].boxplot([cluster_comp_df.loc[cluster_comp_df['high_dl']==True,
cluster_comp_df.columns[0]],
cluster_comp_df.loc[cluster_comp_df['high_dl']==False,
cluster_comp_df.columns[0]]])
ax[0][0].set_ylabel(cluster_comp_df.columns[0],size=10)
ax[0][0].set_title('PANEL A:\n'+cluster_comp_df.columns[0],size=12)
ax[0][1].boxplot([cluster_comp_df.loc[cluster_comp_df['high_dl']==True,
cluster_comp_df.columns[1]],
cluster_comp_df.loc[cluster_comp_df['high_dl']==False,
cluster_comp_df.columns[1]]])
ax[0][1].set_ylabel(cluster_comp_df.columns[1],size=10)
ax[0][1].set_title('PANEL B:\n'+cluster_comp_df.columns[1],size=12)
ax[1][0].boxplot([cluster_comp_df.loc[cluster_comp_df['high_dl']==True,
cluster_comp_df.columns[2]],
cluster_comp_df.loc[cluster_comp_df['high_dl']==False,
cluster_comp_df.columns[2]]])
ax[1][0].set_ylabel(cluster_comp_df.columns[2],size=10)
ax[1][0].set_title('PANEL C:\n'+cluster_comp_df.columns[2],size=12)
ax[1][1].boxplot([cluster_comp_df.loc[cluster_comp_df['high_dl']==True,
cluster_comp_df.columns[3]],
cluster_comp_df.loc[cluster_comp_df['high_dl']==False,
cluster_comp_df.columns[3]]])
ax[1][1].set_ylabel(cluster_comp_df.columns[3],size=12)
ax[1][1].set_title('PANEL D:\n'+cluster_comp_df.columns[3],size=12)
#for num in np.arange(0,4):
# ax[num].boxplot(
# [cluster_comp_df.loc[cluster_comp_df['high_dl']==True,
# cluster_comp_df.columns[num]],
# cluster_comp_df.loc[cluster_comp_df['high_dl']==False,
# cluster_comp_df.columns[num]]])
#
# ax[num].set_xticklabels(['High DL categories','Low DL categories'])
# ax[num].set_title(cluster_comp_df.columns[num],size=14)
return(self)
In [202]:
#Run the test and segment the categories based on their share of DL papers
papers_clust = test_2.papers
test_cl = DlPaperAnalysisCluster(papers_clust)
test_cl.segment(levels=[0.4,0.1])
In [203]:
#These are the test groups
test_cl.cat_groups
Out[203]:
In [204]:
test_cl.cluster_periods(min_samples=50,epsilon=20)
In [205]:
fig,ax = plt.subplots(figsize=(7,7),nrows=2,ncols=2,sharex=True)
test_cl.visualise_differences(ax)
ax[1][0].set_xticklabels(['High DL activity','Low DL activity'],size=12,rotation=45,ha='right')
ax[1][1].set_xticklabels(['High DL activity','Low DL activity'],size=12,rotation=45,ha='right')
plt.tight_layout()
plt.savefig(fig_path+'/paper_figures/figure_6_micro_comparison.pdf',bbox_to_inches='tight')
What does all the above mean? It means that:
We have evidenced the GPT nature of DL and also the disruption it has created in existing research networks. Here we focus on explaining the drivers for change. What predicts if a region is part of one of the new clusters?
Our hypothesis:
In [206]:
#Create target and features
#We will use region indices to keep track of variables
#DF to use ('mv_data' means multivariate data)
mv_data = papers_clust.loc[papers_clust['region']!='',:]
In [207]:
#Load the CrunchBase data (which lives in the 'grant data')
cb_data = pd.read_csv(ext_data+'/csv_export.tar.gz',compression='gzip')
#Drop observations for which we have no role or address
cb_data.dropna(axis=0, subset=['roles','address','uuid','founded_on'],inplace=True)
cb_data['is_comp'] = ['company' in x for x in cb_data.roles]
#Focus on companies
cb_comps = cb_data.loc[cb_data['is_comp']==True,:]
#TODO: turn these point in polygon operations into a function.
#Load the geocoded data
cb_geo_sample = pd.read_csv(ext_data+'/cb_geolocated_full.csv')
#Now we turn the lat and lon into coordinates
cb_points = cb_geo_sample.set_index('uuid')[['lng','lat']].apply(lambda x: Point([x['lng'],x['lat']]),axis=1)
#Create geodataframe
cb_geodf = gp.GeoDataFrame(geometry=cb_points)
#Combine with the shapefile
cb_geodf.crs = admin_shape.crs
#And do the spatial join - the operation indicates that we are doing a point in polygon.
cb_geographies = gp.sjoin(cb_geodf,admin_shape,op='within')
#Put the regions back into the company data
cb_comps_geo = cb_comps.set_index('uuid').join(cb_geographies['country_reg'])
#And rename them with 'region_geo' (there is already a 'region' variable in CrunchBase)
#cb_comps_geo.rename(columns={'name_en':'region_geo'},inplace=True)
In [208]:
#Target container
target = pd.DataFrame(index=set(mv_data['country_reg'])-set(' '))
In [209]:
#Identify DL clusters
#This is an exploratory analysis we will put into another notebook afterwards
#Get DL papers
dl_papers = mv_data.loc[mv_data['is_dl']=='dl',:]
#Get period 1 and period 2
dl_papers_0 = dl_papers.loc[dl_papers['year']<2012,:]
dl_papers_1 = dl_papers.loc[dl_papers['year']>2012,:]
#Identify clusters in both periods
dl_clusters_0, dl_clusters_1 = [cluster_report(paps,
dbscan(paps,
coords=['grid_lon','grid_lat'],epsilon=10,
min_samples=100)) for paps in [dl_papers_0,dl_papers_1]]
In [210]:
#Get cluster frequencies
clust_locs_0,clust_locs_1 = [flatten_freqs([df[4].reset_index(level=0).index]) for df in [dl_clusters_0,dl_clusters_1]]
#One problem here is that we are assuming that the locations have similar sizes. They could host a similar number of clusters.
#That's not the case. TODO: control for region size.
In [211]:
#Create the cluster frequency count
target['y'] = pd.concat([target,clust_locs_1],axis=1,join='inner')
#There are 905 names in the data
target.fillna(0,inplace=True)
In [212]:
#target = pd.concat([target,clust_locs_1],axis=1).fillna(0)
#target.rename(columns={0:'y'},inplace=True)
cluster_locs_df = pd.DataFrame(clust_locs_1.reset_index(drop=False))
cluster_locs_df.rename(columns={'index':'Administrative area (country)',0:'Number of clusters'},inplace=True)
cluster_locs_df['Administrative area (country)']=[x[3:]+' ({coun})'.format(coun=x[:2]) for x in
cluster_locs_df['Administrative area (country)']]
cluster_locs_df.loc[:15].to_latex(fig_path+'/tables/top_clusters.tex',index=False)
In [213]:
pd.Series([x.split('_')[0] for x in clust_locs_1.index]).value_counts().head()
Out[213]:
In [214]:
import scipy
In [215]:
#We focus on the first category for each paper
mv_data['arxiv_first_cat'] = [x[0] for x in mv_data['arxiv_categories']]
#And in the period before 2012
mv_data_pre = mv_data.loc[mv_data.year<2012,:]
#Arxiv totals
arxiv_totals = mv_data_pre.groupby('country_reg').size()
arxiv_totals.name='arxiv_totals'
arxiv_totals.shape
Out[215]:
In [216]:
#Measures of diversity
#Unique number of disciplines
#This creates total number of papers by category
arxiv_cat_totals = pd.pivot_table(mv_data_pre.groupby(['country_reg','arxiv_first_cat']).size().reset_index(drop=False),
index='country_reg',columns='arxiv_first_cat',values=0).fillna(0)
#Now this gives us the total number of disciplines present in a region
arxiv_div_n = arxiv_cat_totals[[x for x in arxiv_cat_totals.columns if x[:2]=='cs']].apply(lambda x: x>0,axis=1).sum(axis=1)
arxiv_div_n.name='arxiv_total_discs'
#And the shannon entropy
arxiv_entropy = arxiv_cat_totals[[x for x in arxiv_cat_totals.columns if x[:2]=='cs']].apply(
lambda x: scipy.stats.entropy(x),axis=1)
arxiv_entropy.name = 'arxiv_entropy'
#Concatenate all arXiv data in a single df
arxiv_pred = pd.concat([arxiv_totals,arxiv_div_n,arxiv_entropy],axis=1)
#Remove empty regions
arxiv_pred = arxiv_pred.loc[arxiv_pred.index!='',:]
#Note that this probably has a lower number of observations because there were a bunch of places that
#had no activity before 2012
arxiv_pred.shape
Out[216]:
In [217]:
arxiv_pred.corr()
Out[217]:
CrunchBase has multiple categories for each company. How do we use this to measure diversity?
The simplest options is simply to flatten the list of categories and count them. Assume that these are capabilities 'present' in a location. Other options would be to do some topic modelling of the categories and identify the top category for each company. We will leave that as a TODO
In [218]:
# We are focusing on the total levels of activity and diversity before 2012
cb_comps_geo_pre = cb_comps_geo.loc[[int(x.split('-')[0])<2012 for x in cb_comps_geo.founded_on],:]
In [219]:
#Total CB activity (before 2012)
cb_totals = cb_comps_geo_pre.groupby('country_reg').size()
cb_totals.name = 'cb_totals'
In [220]:
# Diversity of activity
#Remove companies with missing sectors
cb_comps_cat = cb_comps_geo_pre.dropna(axis=0,subset=['category_list'])
#Turn them into a list for each company
cb_comps_cat['sector_list'] = cb_comps_cat['category_list'].apply(lambda x: x.split(','))
#And now we want the totals for each category
cb_cat_totals = pd.pivot_table(
cb_comps_cat.groupby('country_reg')['sector_list'].apply(lambda x: flatten_freqs(list(x))).reset_index(drop=False),
index='country_reg',columns='level_1',values='sector_list').fillna(0)
#Total number of areas present
cb_div_n = cb_cat_totals.apply(lambda x: np.sum(x>0),axis=1)
cb_div_n.name='cb_total_sectors'
#Entropy
cb_entropy = cb_cat_totals.apply(lambda x: scipy.stats.entropy(x),axis=1)
cb_entropy.name='cb_entropy'
In [221]:
#Concatenate everything
cb_pred = pd.concat([cb_totals,cb_div_n,cb_entropy],axis=1,sort=True).loc[target.index]
cb_pred.head().fillna(0)
Out[221]:
In [222]:
cb_pred.corr()
Out[222]:
In [223]:
#These are all the locations that have at least some arxiv activity
cb_pred.shape
Out[223]:
In [224]:
cb_pred.to_csv(proc_data+'/{date_today}_cb_data.csv'.format(date_today=today_str))
In [225]:
#Create the control df
control = pd.DataFrame(index=set(mv_data['country_reg']))
control.shape
Out[225]:
In [226]:
#We want to add a dummy for whether a region is in China or not, and a country variable for clustering errors
#We get the country-region lookup from the admin shapefile we downloaded before
#DF
country_by_region = admin_shape[['iso_a2','country_reg']]
#Create a dict
country_by_region_dict= {x:y for x,y in zip(country_by_region['country_reg'],country_by_region['iso_a2'])}
#European countries
eu_countries = ['AT','BE','BG','CY','CZ','DK','EE','FI','FR','DE','GR','HU','IE','IT',
'LV','LT','LU','MT','NL','PL','PT','RO','SK','SI','ES','SE','GB']
In [227]:
#Add the information
control['country'] = [country_by_region_dict[x] for x in control.index]
control['is_china'] = [int(x=='CN') for x in control['country']]
control['is_europe'] = [int(x in eu_countries) for x in control['country']]
#Add the area
admin_shape['admin_area'] = admin_shape['geometry'].area
#Ach there are a few duplicated regions!!
control = pd.concat([
control.join(admin_shape.drop_duplicates('name_en').set_index('country_reg')['admin_area']),
clust_locs_0],axis=1).fillna(0)
control.rename(columns={0:'cluster_t0'},inplace=True)
control.shape
Out[227]:
In [228]:
#Concatenate all the variables
pred = pd.concat([arxiv_pred,cb_pred],axis=1,join='inner').fillna(0)
#Only consider situations where there is more than 1 paper in a location
pred = pred.loc[pred['arxiv_totals']>0,:]
#Finally, some transformations in the data
#Log the totals
for x in pred.columns:
if 'totals' in x:
pred[x]=np.log(pred[x]+0.01)
#Calculate zscores (removing a small number of infinite values)
pred.replace([np.inf, -np.inf], np.nan,inplace=True)
pred_norm = pred.dropna()
pred_norm = pred_norm.apply(lambda x: scipy.stats.zscore(x),axis=0)
pred_norm.corr()
pred.shape
Out[228]:
In [229]:
import scipy
In [230]:
data = pd.concat([target,pred_norm,control],axis=1,join='inner')
#target.join(pred_norm.join(control,how='inner'),how='outer')
data.fillna(0,inplace=True)
data.shape
Out[230]:
In [231]:
#Correlation between variables
data.corr()
Out[231]:
In [232]:
#This is the correlation table
my_vars = ['y','arxiv_totals','arxiv_entropy','cb_totals','cb_entropy','is_china']
corr_table = data[my_vars].corr().applymap(lambda x: np.round(x,3))
corr_table.to_latex(fig_path+'/tables/table_3_correlation_table.tex')
In [233]:
# Produce a bunch of boxplots comparing the independent variables for y > 1
#Variables to plot
plot_vars = ['arxiv_totals','arxiv_entropy','cb_totals','cb_entropy']
#Lay down the plot
fig,ax = plt.subplots(nrows=(len(plot_vars)),figsize=(4,13),sharex=True)
#For each variable, draw the boxplot
for num,x in enumerate(plot_vars):
ax[num].boxplot([data.loc[data['y']==0,x],data.loc[data['y']>0,x]])
#Set title
ax[num].set_ylabel(x,size=12)
#Add labels
ax[num].set_xticklabels(['No cluster','Has cluster'])
plt.tight_layout()
These boxplots are broadly consistent with the key hypotheses of our analysis: locations that acquired a concentration of DL research tended to be bigger and more diverse, and also to host stronger concentrations of tech companies based on the CrunchBase data
In [234]:
# Remember to:
#Cluster standard errors on country
#Do robust standard errors
import statsmodels.api as sm
data['intercept']=1
data['entropy_interaction'] = data['arxiv_entropy']*data['cb_entropy']
data['research_industry_interaction'] = data['arxiv_totals']*data['cb_totals']
x_vars = ['intercept',
#'arxiv_totals',
'arxiv_entropy',
#'cb_totals',
'cb_entropy',
#'research_industry_interaction',
#'entropy_interaction',
'is_china',
#'is_europe',
#'cluster_t0',
'admin_area']
data.shape
#data_2 = data.loc[data.arxiv_totals>np.median(data.arxiv_totals),:]
Out[234]:
In [235]:
#Test model with robust standard errors
test_model= sm.Poisson(data['y'],data[x_vars]).fit(
cov_type='cluster',cov_kwds={'groups':data['country']})
In [236]:
test_model.summary()
Out[236]:
In [237]:
summ = test_model.summary()
beginningtex = """\\documentclass{report}
\\usepackage{booktabs}
\\begin{document}"""
endtex = "\end{document}"
f = open(fig_path+'/tables/table_poission.tex', 'w')
f.write(beginningtex)
f.write(summ.as_latex())
f.write(endtex)
f.close()
In [238]:
# As before, write a class that does all this.
class DlSpatialMulti():
'''
This class is initialised with a df with papers and the independent variables and controls dataset produced above.
Methods:
-.get_dl_results uses DBSCAN to extract a DL cluster taking key parameters (epsilon, lambda) and models
cluster development with data features.
-.get_benchmark uses DBSCAN to extract DL clusters in other arXiv categories taking key parameters (epsilon, lambda) and models
cluster development with data features.
-.comparison compares the results of the models.
'''
def __init__(self,papers,independent_variables):
'''
Initialise with the papers and independent variables
'''
#Store papers
self.papers = papers
#Store independent variables
self.ind_vars = independent_variables
#Dict to store models and data. We will use the keys to label the information that is stored
self.model_store = {}
self.data_store = {}
def get_dl_results(self,thres=2012,epsilon=10,min_samples=100,
x_vars=['intercept',
#'arxiv_totals',
'arxiv_entropy',
#'cb_totals',
'cb_entropy',
#'entropy_interaction',
#'research_industry_interaction',
'is_china',
#'is_europe',
'admin_area']):
'''
This method extracts and models DL clusters. The inputs are a cut-off threshold for cluster emergence,
the DBSCAN parameters and the variables to use in the prediction
'''
#Load papers
papers = self.papers
#Target container
#The index is the number of regions
target = pd.DataFrame(index=set(self.papers['country_reg'])-set(' '))
#Get DL papers
dl_papers = papers.loc[papers['is_dl']=='dl',:]
#Split into two periods
dl_papers_0 = dl_papers.loc[dl_papers['year']<thres,:]
dl_papers_1 = dl_papers.loc[dl_papers['year']>thres,:]
#Identify clusters in both periods
dl_clusters_0, dl_clusters_1 = [cluster_report(paps,
dbscan(paps,
coords=['grid_lon','grid_lat'],epsilon=epsilon,
min_samples=min_samples)) for paps in [
dl_papers_0,dl_papers_1]]
#Get cluster frequencies in both periods
clust_locs_0,clust_locs_1 = [
flatten_freqs([df[4].reset_index(level=0).index]) for df in [dl_clusters_0,dl_clusters_1]]
#Name the cluster locations in t0 (may use as contrl)
clust_locs_0.name = 'cluster_t0'
#Create the cluster frequency count
target['y'] = pd.concat([target,clust_locs_1],axis=1,sort=True,join='outer')
#Fill the missing values
target.fillna(0,inplace=True)
#Concatenate the cluster results with the features
dl_data = pd.concat([target,self.ind_vars],axis=1,sort=True,join='inner').fillna(0)
#Model using xvars. NB we are using clustered standard errors by country.
model= sm.Poisson(dl_data['y'],
dl_data[x_vars]).fit(cov_type='cluster',cov_kwds={'groups':dl_data['country']},
maxiter=5000, maxfun=5000)
#Store everything
self.data_store['dl'] = dl_data
self.model_store['dl'] = model
return(self)
def get_benchmark_results(self,category,thres=2012,epsilon=10,min_samples=100,
x_vars=['intercept',
#'arxiv_totals',
'arxiv_entropy',
#'cb_totals',
'cb_entropy',
'is_china',
#'is_europe',
#'cluster_t0',
#'entropy_interaction',
#'research_industry_interaction',
'admin_area']):
'''
This method does the same as above but for a selected arXiv category.
'''
#Load papers
papers = self.papers
#Target container
target = pd.DataFrame(index=set(self.papers['country_reg'])-set(' '))
#Subset papers to find the relevant category
subset = papers.loc[[category in arxiv_cat for arxiv_cat in papers['arxiv_categories']]]
#Split into papers in t0 and papers in t1. Note that this excludes the threshold year as a 'boundary'
subset_0 = subset.loc[subset['year']<thres]
subset_1 = subset.loc[subset['year']>thres]
#TODO: This is repeating the above. Refactor sometime?
#Fit the clustering algorithm for the two subsets in the data and generate the reports
db_report_0,db_report_1 = [cluster_report(subset,dbscan(subset,coords=['grid_lon','grid_lat'],
epsilon=epsilon,
min_samples=min_samples)) for subset in [subset_0,subset_1]]
#Get cluster frequencies
clust_locs_0,clust_locs_1 = [
flatten_freqs([df[4].reset_index(level=0).index]) for df in [db_report_0,db_report_1]]
clust_locs_0.name = 'cluster_t0'
#Create the cluster frequency count
target['y'] = pd.concat([target,clust_locs_1],axis=1,sort=True,join='outer')
#Fill missing values
target.fillna(0,inplace=True)
#Store all the information
#Concatenate the target, features etc
bm_data = pd.concat([target,self.ind_vars],axis=1,sort=True,join='inner').fillna(0)
#Model
bm_model= sm.Poisson(bm_data['y'],
bm_data [x_vars]).fit(cov_type='cluster',cov_kwds={'groups':bm_data['country']},
maxiter=5000, maxfun=5000)
#Store everything using the right key
self.data_store[category] = bm_data
self.model_store[category] = bm_model
return(self)
def compare_results(self,ax):
'''
This method creates a barchart with confidence intervals to compare the results.
'''
#Load the results
#For each key in the stored models we will extract model outputs (parameters and confidence intervals)
parameters = []
conf_int = []
#Do this
for x in self.model_store.keys():
parameters.append(self.model_store[x].params)
#Extract the confident intervals
conf_ints = self.model_store[x].conf_int()
conf_ints_low = conf_ints[0]
conf_ints_high = conf_ints[1]
conf_int.append([conf_ints_low,conf_ints_high])
#Create dataframes and give them columns
model_results_df = pd.concat(parameters,axis=1)
conf_int_low_df = pd.concat([x[0] for x in conf_int],axis=1)
conf_int_high_df = pd.concat([x[1] for x in conf_int],axis=1)
#Variable names
model_results_df.columns= self.model_store.keys()
conf_int_low_df.columns= self.model_store.keys()
conf_int_high_df.columns= self.model_store.keys()
#And to plot
my_vars = [
#'arxiv_totals',
'arxiv_entropy',
#'cb_totals',
'cb_entropy',
#'research_industry_interaction',
#'entropy_interaction',
'is_china',
#'is_europe'
]
#DFs to plot
plot_df = model_results_df.loc[my_vars]
low = conf_int_low_df.loc[my_vars]
high = conf_int_high_df.loc[my_vars]
# Do the barplot
#Number of bars in the x axis
x_n = np.arange(len(plot_df))
for num,col in enumerate(plot_df.columns):
ax.bar(x=x_n+0.15*num,
height=plot_df[col],
width=0.15,
#yerr=[high[col],low[col]],
yerr = np.array(low[col],high[col]),
align='center',
ecolor='black', capsize=3,alpha=0.5
)
#print(plot_df[col])
#print(low[col])
#print(high[col])
ax.set_xticks(x_n+0.25)
ax.set_xticklabels(my_vars,rotation=45,ha='right',size=12)
ax.legend(list(plot_df.columns),loc='upper right',title='Category',fontsize=12,ncol=len(model_results_df))
ax.set_title('Coefficients for key variables in Poisson Regression',size=16)
self.params = plot_df
self.conf_int = [low,high]
In [239]:
#These are the 'shared' independent variables
data = pd.concat([pred_norm,control.drop('cluster_t0',axis=1)],axis=1,sort=False,join='inner').fillna(0)
data['intercept']=1
data['entropy_interaction'] = data['arxiv_entropy']*data['cb_entropy']
data['research_industry_interaction'] = data['arxiv_totals']*data['cb_totals']
In [240]:
['cs.NI', 'cs.DM', 'cs.CC', 'cs.DS', 'cs.LO', 'cs.IT']
Out[240]:
In [241]:
#xs are the indep variables
xs = data
#Initialie
test = DlSpatialMulti(mv_data,xs)
test.get_dl_results(min_samples=100,epsilon=10)
for cat in ['cs.NI','cs.IT','cs.DS','cs.CR']:
print(cat)
test.get_benchmark_results(cat,min_samples=50,epsilon=10)
In [242]:
fig,ax = plt.subplots(figsize=(10,6))
test.compare_results(ax=ax)
ax.set_title('')
ax.set_ylabel('Estimated coefficient',fontsize=12)
#ax.set_ylab
plt.tight_layout()
plt.savefig(fig_path+'/paper_figures/figure_7_poisson_comparison.pdf',bbox_to_inches='tight')
In [243]:
#xs are the indep variables
xs = data
#Initialie
test_2 = DlSpatialMulti(mv_data,xs)
test_2.get_dl_results(min_samples=50,epsilon=10)
for cat in ['cs.NE','cs.AI','cs.CV','cs.CL','stat.ML']:
print(cat)
test_2.get_benchmark_results(cat,min_samples=50,epsilon=10)
fig,ax = plt.subplots(figsize=(15,6))
test_2.compare_results(ax=ax)
In [244]:
comb = pd.concat([cb_totals,arxiv_totals],axis=1).fillna(0)
comb.sort_values('arxiv_totals',ascending=False,inplace=True)
comb = comb.apply(lambda x: x/x.sum(),axis=0).iloc[:100,:]
In [245]:
comb.corr()
Out[245]:
In [262]:
n_iters = 5
import scipy.stats as ss
def calc_ECI_plus(X, n_iters):
ECI_mat = np.zeros((X.shape[0], n_iters))
x = X.values.sum(axis=1)
x = x/ss.gmean(x)
ECI_mat[:, 0] = x
for n in range(1, n_iters):
x = (X.values/(X.values/ECI_mat[:, n-1][:, np.newaxis]).sum(0)).sum(1)
x = x/ss.gmean(x)
ECI_mat[:, n] = x
ECI = np.log(ECI_mat[:, -1]) - np.log((X/X.sum(0)).sum(1))
return pd.DataFrame(ECI_mat, index=ECI.index), ECI
arxiv_eci = calc_ECI_plus(create_lq_df(arxiv_cat_totals),n_iters=5)[1]
cb_eci = calc_ECI_plus(create_lq_df(cb_cat_totals),n_iters=5)[1]
ecis= pd.concat([arxiv_eci,cb_eci],axis=1)
ecis.columns=['arxiv_eci','cb_eci']
ecis['eci_interaction'] = ecis['arxiv_eci']*ecis['cb_eci']
In [274]:
dl_0 = create_lq_df(pd.crosstab(papers_clust.loc[papers_clust.year<2012,'country_reg'],
papers_clust.loc[papers_clust.year<2012,'is_dl'])).apply(lambda x: np.log(x+0.001))
dl_1 = create_lq_df(pd.crosstab(papers_clust.loc[papers_clust.year>2012,'country_reg'],
papers_clust.loc[papers_clust.year>2012,'is_dl'])).apply(lambda x: np.log(x+0.001))
dl_periods = pd.concat([dl_0['dl'],dl_1['dl']],axis=1,join='inner')
dl_periods.columns=['dl_0','dl_1']
data_2 = pd.concat([dl_periods,data,ecis],axis=1,join='inner')
#data_2 = data_2.loc[data_2.arxiv_totals>np.percentile(data_2.arxiv_totals,75),:]
data_2.corr()
Out[274]:
In [278]:
data_2.fillna(0,inplace=True)
data_2.shape
# Remember to:
#Cluster standard errors on country
#Do robust standard errors
import statsmodels.api as sm
x_vars = ['intercept',
#'arxiv_totals',
#'arxiv_entropy',
#'cb_totals',
#'cb_entropy',
#'research_industry_interaction',
'arxiv_eci',
'cb_eci',
#'entropy_interaction',
#'eci_interaction',
'is_china',
'is_europe',
'dl_0',
'admin_area']
#data_2 = data_2.loc[data_2.arxiv_totals>np.median(data_2.arxiv_totals),:]
#Test model with robust standard errors
test_model= sm.OLS(data_2['dl_1'],data_2[x_vars]).fit(
cov_type='cluster',cov_kwds={'groups':data_2['country']})
test_model.summary()
Out[278]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: