Sampling of papers from conferences

This Jupyter notebook includes the procedure used to select a random sample from the population of accepted papers for each conference.

Accepted conference papers

Sampling of papers is based on the listing of accepted papers at the following locations:

AAAI-14 http://www.aaai.org/Library/AAAI/aaai14contents.php

AAAI-16 http://www.aaai.org/Library/AAAI/aaai16contents.php

IJCAI-13 http://ijcai-13.org/program/accepted_papers

IJCAI-16 http://ijcai-16.org/index.php/welcome/view/accepted_papers

These listings were used to generate the files available in the data/ folder. Each conference is represented by a textfile containing the papers accepted to the conference's main and special tracks. Each line in the textfiles represent a paper, including its title and the authors. Example:

Causality based Propagation History Ranking in Social Networks  Zheng Wang, Chaokun Wang, Jisheng Pei, Xiaojun Ye and Philip S. Yu  
Intervention Strategies for Increasing Engagement in Volunteer-Based Crowdsourcing  Avi Segal, Kobi Gal, Ece Kamar, Eric Horvitz, Alex Bowyer and Grant Miller

Papers are available through AAAI Publications for all but IJCAI-16 (at the time of writing):

AAAI-14 http://www.aaai.org/ocs/index.php/AAAI/AAAI14/schedConf/presentations

AAAI-16 http://www.aaai.org/ocs/index.php/AAAI/AAAI16/schedConf/presentations

IJCAI-13 http://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/schedConf/presentations

For IJCAI-16, see the proceedings at: http://www.ijcai.org/Proceedings/2016

First, the accepted papers are loaded from files.


In [1]:
from glob import glob

accepted_papers = {}
track_files = glob('data/accepted*'.format(dir))
for file in track_files:
    conference = file.split('_')[-1].strip('.txt')
    accepted_papers[conference] = []
    with open(file, 'r') as f:
        for line in f:
            accepted_papers[conference].append(line)

The resulting dictionary accepted_papers contains a list of the accepted papers for each conference.


In [2]:
for conference, papers in sorted(accepted_papers.items()):
    print('{conference} includes {papers} accepted papers.'.format(
        conference=conference, papers=len(papers)))


aaai-14 includes 398 accepted papers.
aaai-16 includes 548 accepted papers.
ijcai-13 includes 413 accepted papers.
ijcai-16 includes 551 accepted papers.

Selection

A sample population of 100 papers is selected from each conference using Python's pseudo-random number module. As per the documentation on random.sample "The resulting list is in selection order so that all sub-slices will also be valid random samples." The seed is set to the unix timestamp for Jan 10 14:46:40 2017 UTC: 1484059600.


In [3]:
import random
random.seed(1484059600)

k = 100
samples = {}

# The order is set explicitly due to originally not sorting
# accepted_papers.items().
conferences = ['aaai-16', 'aaai-14', 'ijcai-13', 'ijcai-16']

for conference in conferences:
    samples[conference] = random.sample(accepted_papers[conference], k)

Note that when originally generating the samples, the dictionary was iterated by the use of Python 3's dict.items() view. The order is not guaranteed. Due to the original generation not being sorted, the iteration needs to be set explicitly so future runs generate the same original sample populations.

The generated random samples are permanently stored to files in the ../data/ directory (Github: https://github.com/sidgek/msoppgave/tree/master/data/.


In [4]:
for conference, papers in samples.items():
    outputfile = 'data/sampled_{conference}'.format(conference=conference)
    with open(outputfile, 'w') as f:
        for line in papers:
            f.write(line)

Versions

Here's a generated output to keep track of software versions used to run this Jupyter notebook.


In [5]:
import IPython
import platform

print('Python version: {}'.format(platform.python_version()))
print('IPython version: {}'.format(IPython.__version__))


Python version: 3.5.3
IPython version: 6.1.0