In [1]:
import pandas
import numpy as np
First, we read the input, and take a look at the column names
In [2]:
fox = pandas.read_csv("../data/Fox2015_data.csv")
In [3]:
fox.columns
Out[3]:
Extract the unique manuscripts and count them
In [4]:
unique_ms = list(set(fox["MsID"]))
num_ms = len(unique_ms)
print(num_ms)
We restructure the data to individual lists for the number of reviewers, the final decision, and the year for each manuscript. At the end, we convert the lists into np.arrays, as it is much easier to subset them.
In [5]:
num_reviewers = []
final_decision = []
year = []
for ms in unique_ms:
# extract the rows
subset = fox[fox["MsID"] == ms]
# count number of reviewers by summing ReviewerAgreed
num_reviewers.append(sum(subset["ReviewerAgreed"]))
# extract final decision
if list(subset["FinalDecision"])[0] == 1:
final_decision.append(1)
else:
final_decision.append(0)
# extract year
year.append(list(subset["Year"])[0])
# convert to np.array
num_reviewers = np.array(num_reviewers)
final_decision = np.array(final_decision)
year = np.array(year)
Now we write a function that takes a year as input, and prints the rejection rate for each number of reviewers, along with some other summary information. If we call the function with 'all'
instead of a year, then the analysis is performed on the whole data set.
In [6]:
def get_prob_rejection(my_year = "all"):
# subset the data
if my_year != "all":
my_num_reviewers = num_reviewers[year == my_year]
my_final_decision = final_decision[year == my_year]
else:
my_num_reviewers = num_reviewers
my_final_decision = final_decision
# start printing output
print("===============================")
print("Year:", my_year)
print("Submissions:", len(my_final_decision))
print("Overall rejection rate:",
round(my_final_decision.mean(), 3))
print("NumRev", "\t", "NumMs", "\t", "rejection rate")
for i in range(max(my_num_reviewers) + 1):
print(i, "\t",
len(my_final_decision[my_num_reviewers == i]), "\t",
round(my_final_decision[my_num_reviewers == i].mean(), 3))
print("===============================")
In [7]:
get_prob_rejection("all")
It seems that a higher number of reviewers indeed means a higher probability of rejection. Especially, look at the difference between one and two reviewers.
We can simply call the function for each year. For example:
In [8]:
get_prob_rejection(2009)
In [9]:
for yr in range(2004, 2015):
get_prob_rejection(yr)