The Stark Realities of Reproducible Sociological Research

</span>

Professor Vernon Gayle, University of Edinburgh, UK.


vernon.gayle@ed.ac.uk or @profbigvern.

Overview

Reproducible research (large-scale survey data resource)

Duplication and Replication

Workflow in a Jupyter Notebook

Open via Github

Research Question

Can a sociological researcher follow Professor Philip Stark's checklist for reproducible research and undertake a plausible piece of analysis, using genuine large-scale data with realistic levels of messiness?

GET OUT YOUR LAPTOP

go to...

__nbviewer.jupyter.org/__

type....

__github.com/vernongayle/new_rules_of_the_sociological_method/blob/master/noobs.ipynb__

Research Ethics Approval Application

A research ethics approval application has been made to the School of Social and Political Science, University of Edinburgh and has been posted here on Github .

Research Ethics Approval

From: MOORE Niamh
Sent: 26 June 2017 17:01
To: GAYLE Vernon Vernon.Gayle@ed.ac.uk
Cc: SSPS Research ssps.research@ed.ac.uk
Subject: FW: Ethics form submission (Vernon Gayle: The Stark Realities of Reproducible Sociological Research)

Hi Vernon,

Approved at level 1. If only they were all so straightforward. Good luck with the project. All the best Niamh All the best with your application. Niamh Dr Niamh Moore

Chancellor's Fellow I Deputy Director of Research (Ethics) Sociology I Room 3.09, 3F2 I 18 Buccleuch Place
School of Social and Political Sciences I University of Edinburgh I Edinburgh EH8 8LN

niamh.moore@ed.ac.uk l @rawfeminism l +44(0)131-6508260 l skype: niamhresearcher http://www.sociology.ed.ac.uk/people/staff/niamh_moore

Pre-Analysis Plan

A pre-analysis plan is openly available in word format .

The pre-analysis plan has been formally timestamped by Originstamp.

hash: ca0fc7d948fd67cf8a1a2ac9111e9bf40425c010dfdf76ef33a0e578a90981a8

Submitted to OriginStamp: 24 Jun 2017 21:00:24 GMT Submitted to the Blockchain: 25 Jun 2017 16:00:21 GMT

This document can be verifyied using the hash at https://app.originstamp.org/verify .

Overview of the Reproducibility Checklist

http://www.bitss.org/2015/12/31/science-is-show-me-not-trust-me/

Philip Stark outlines 14 reproducibility points that an analysis can fail on

See the handout

Overview of the Reproducibility Checklist

http://www.bitss.org/2015/12/31/science-is-show-me-not-trust-me/

Philip Stark outlines 14 reproducibility points that an analysis can fail on

  1. If you relied on Microsoft Excel for computations, fail.
  2. If you did not script your analysis, including data cleaning and munging, fail.
  3. If you did not document your code so that others can read and understand it, fail.
  4. If you did not record and report the versions of the software you used (including library dependencies), fail.
  5. If you did not write tests for your code, fail.
  6. If you did not check the code coverage of your tests, fail.
  7. If you used proprietary software that does not have an open-source equivalent without a really good reason, fail.
  8. If you did not report all the analyses you tried (transformations, tests, selections of variables, models, etc.) before arriving at the one you chose to emphasize, fail.
  9. If you did not make your code (including tests) available, fail.
  10. If you did not make your data available (and a law like FERPA or HIPPA doesn’t prevent it), fail.
  11. If you did not record and report the data format, fail.
  12. If there is no open source tool for reading data in that format, fail.
  13. If you did not provide an adequate data dictionary, fail.
  14. If you published in a journal with a paywall and no open-access policy, fail.

1. If you relied on Microsoft Excel for computations

Excel was not used in this work.

2. If you did not script your analysis, including data cleaning and munging

This is common in stratification research and quantitative sociology more generally.

3. If you did not document your code so that others can read and understand it

As far as practicable I have attempted to write this Jupyter notebook as a 'literate data analysis document'.

I provided information on using this notebook, and on the authorship and meta-information.

Literate Computing

Fernando Perez says

"Literate Computing is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components, for the goals of communicating results in scientific computing and data analysis" (see http://blog.fperez.org/).

Authorship and Meta Information

Author: Professor Vernon Gayle Orcid id:
http://orcid.org/0000-0002-1929-5983

Project: Reproducible Sociological Research

Sub-project: Stratification Conference (Edinburgh) September 2017

4. If you did not record and report the versions of the software you used (including library dependencies)

I reported on the computing environment and data analysis software including library dependencies.

Computing Environment

Work undertaken using machine surface pro 109.152.252.166 (my public IP address).

Processor: Intel(R) Core™ i5-4300U CPU@ 1.90 GHz 2.50 GHz
Installed memory (RAM): 4.00 GB
System type: 64-bit Operating System, x-64 based processor

Data Analysis Software

R Analysis

The data analysis that will be undertaken in this paper will mainly be undertaken in R.

The decision to use R is motivated by checklist item 7, and it is an attempt to use and open source data analytical software rather than a proprietary software package.


In [6]:
# If you have not run the notebook sequentially... 

# theses libraries are required

library(foreign)
library(survey)
library(car)
library(dplyr)
library(weights)
library(dummies)


Warning message:
: package 'weights' was built under R version 3.2.5Loading required package: Hmisc
Warning message:
: package 'Hmisc' was built under R version 3.2.5Loading required package: ggplot2
Warning message:
: package 'ggplot2' was built under R version 3.2.5
Error: package 'ggplot2' could not be loaded

5. If you did not write tests for your code

I provided two code tests, one for logistic regression and one for quasi-variance estimation, which are checked against published results.


In [8]:
summary(myautologit1)


Out[8]:
Call:
glm(formula = foreign ~ weight + mpg, family = "binomial", data = myautodata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0436  -0.4285  -0.2207   0.5347   2.0679  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 13.708367   4.518709   3.034 0.002416 ** 
weight      -0.003907   0.001012  -3.862 0.000113 ***
mpg         -0.168587   0.091917  -1.834 0.066637 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 90.066  on 73  degrees of freedom
Residual deviance: 54.350  on 71  degrees of freedom
AIC: 60.35

Number of Fisher Scoring iterations: 6

These results are identical to the results that are found in the Stata Manual p.1271.

6. If you did not check the code coverage of your tests

I did not write or use any new tests.

7. If you used proprietary software that does not have an open-source equivalent without a really good reason

The data enabling (i.e. wrangling and cleaning) and the analyses were undertaken in R which is an open-source software.

I also tried the analysis in Python, Stata and had to use SPSS in the duplication!

8. If you did not report all the analyses you tried (transformations, tests, selections of variables, models, etc.) before arriving at the one you chose to emphasize

I reported on all the analyses including data transformations, tests, selections of variables, alternative models and failed activities.

9. If you did not make your code (including tests) available

Information on how the code is licensed is provided. The code is available using Github https://github.com/vernongayle .

License

The MIT License (MIT)

Copyright (c) 2017 Vernon Gayle, University of Edinburgh

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

10. If you did not make your data available (and a law like FERPA or HIPPA doesn’t prevent it)

The data cannot be made publically available but researchers can assess the data from the UK Data Service https://www.ukdataservice.ac.uk/ .

11. If you did not record and report the data format

A description of the research dataset and well as information on the data format and the time and date of the dowload are provided.

YCS Cohort Nine (1998-2000) UK Data Archive Study 4009 https://discover.ukdataservice.ac.uk/catalogue/?sn=4009

The population studied was male and female school pupils in England and Wales who had reached minimum school leaving age in the 1996/1997 school year. To be eligible for inclusion they had to be aged 16 on August 31st 1997.

Downloaded: UK Data Service https://www.ukdataservice.ac.uk/
Date: 19th June 2017
Time: 19:54

Finch, S.A., La Valle, I., McAleese, I., Russell, N., Nice, D., Fitzgerald, R., Finch, S.A. (2004). Youth Cohort Study of England and Wales, 1998-2000. [data collection]. 5th Edition. UK Data Service. SN: 4009,
http://doi.org/10.5255/UKDA-SN-4009-1

12. If there is no open-source tool for reading data in that format

The code to read the data, wrangle the data and produce all of the results is written in R which is open-source and is provided in a Jupyter notebook which is also open-source and is available using the open-source platform Github https://github.com/vernongayle.

13. If you did not provide an adequate data dictionary

A data dictionary (or codebook) is provided.

Data Dictionary (or Codebook)

This is the codebook for the file ycs9sw1_v4.rda which contains mydata6.df .


serial id variable unique to YCS cohort 9


weight survey weight sweep 1 YCS cohort 9


s15a_c outcome variable 5+ GCSEs A (star) - C constructed from variable "s1a_c"

0 = 1 - 4 GCSEs grades A (star) - C
1 = 5+ GCSEs grades A (star) - C


14. If you published in a journal with a paywall and no open-access policy

The work has not yet been published. But it will be available through the UK green open access policy via my university repository http://www.research.ed.ac.uk/portal/en/persons/vernon-gayle(682d7da1-a2ad-49f0-b36c-64478c658f99).html .

The Reproducibility Checklist Revisited

In this section I reflect on how the work compares with Stark's Reproducibility Checklist.

http://www.bitss.org/2015/12/31/science-is-show-me-not-trust-me/

Philip Stark outlines 14 reproducibility points that an analysis can fail on

1. If you relied on Microsoft Excel for computations
Excel was not used in this work.

2. If you did not script your analysis, including data cleaning and munging
All of the analysis was scripted see Data Wrangling and Data Analysis.

3. If you did not document your code so that others can read and understand it
As far as practicable I have attempted to write this Jupyter notebook as a 'literate data analysis document'.
I provided information on using this notebook, and on the authorship and meta-information.

4. If you did not record and report the versions of the software you used (including library dependencies)
I reported on the computing environment and data analysis software including library dependencies.

5. If you did not write tests for your code
I provided two code tests, one for logistic regression and one for quasi-variance estimation, which are checked against published results.

6. If you did not check the code coverage of your tests
I did not write or use any new tests.

7. If you used proprietary software that does not have an open-source equivalent without a really good reason
The data enabling (i.e. wrangling and cleaning) and the analyses were undertaken in R which is an open-source software.

8. If you did not report all the analyses you tried (transformations, tests, selections of variables, models, etc.) before arriving at the one you chose to emphasize
I reported on all the analyses including data transformations, tests, selections of variables, alternative models and failed activities.

9. If you did not make your code (including tests) available
Information on how the code is licensed is provided. The code will be made available using Github https://github.com/vernongayle .

10. If you did not make your data available (and a law like FERPA or HIPPA doesn’t prevent it)
The data cannot be made publically available but researchers can assess the data from the UK Data Service https://www.ukdataservice.ac.uk/ .

11. If you did not record and report the data format
A description of the research dataset and well as information on the data format and the time and date of the dowload are provided (similar information is provided for the Croxford et al. (2007) dataset which is used to harvest an alternative social class measure).

12. If there is no open-source tool for reading data in that format
The code to read the data, wrangle the data and produce all of the results is written in R which is open-source and is provided in a Jupyter notebook which is also open-source and will be made available using the open-source platform Github https://github.com/vernongayle.

13. If you did not provide an adequate data dictionary
A data dictionary (or codebook) is provided.

14. If you published in a journal with a paywall and no open-access policy
The work has not yet been published. But it will be available through the UK green open access policy via my university repository http://www.research.ed.ac.uk/portal/en/persons/vernon-gayle(682d7da1-a2ad-49f0-b36c-64478c658f99).html .

Discussion

The Pre-Analysis Plan Reviewed

Tasks

1). Duplication of Logistic Regression Model Reported in Connolly (2006)

Achieved.

2). Replication of Logistic Regression Model Reported in Connolly (2006) Using Quasi-Variance based Estimation

Achieved.

3). Replication of Logistic Regression Model Reported in Connolly (2006) Adding National Socio-economic Classification (NS-SEC) Measure Social Class from UK Data Archive Study 5765

Achieved.

Deliverables

1). A reproducible workflow within a Jupyter notebook deposited in a Git repository
Achieved.

2). A data dictionary (codebook) accompanying the work
Achieved.

The Reproducibility Checklist Revisited

In this section I reflect on how the work compares with Stark's Reproducibility Checklist.

http://www.bitss.org/2015/12/31/science-is-show-me-not-trust-me/

Philip Stark outlines 14 reproducibility points that an analysis can fail on

1. If you relied on Microsoft Excel for computations
Excel was not used in this work.

2. If you did not script your analysis, including data cleaning and munging
All of the analysis was scripted see Data Wrangling and Data Analysis.

3. If you did not document your code so that others can read and understand it
As far as practicable I have attempted to write this Jupyter notebook as a 'literate data analysis document'.
I provided information on using this notebook, and on the authorship and meta-information.

4. If you did not record and report the versions of the software you used (including library dependencies)
I reported on the computing environment and data analysis software including library dependencies.

5. If you did not write tests for your code
I provided two code tests, one for logistic regression and one for quasi-variance estimation, which are checked against published results.

6. If you did not check the code coverage of your tests
I did not write or use any new tests.

7. If you used proprietary software that does not have an open-source equivalent without a really good reason
The data enabling (i.e. wrangling and cleaning) and the analyses were undertaken in R which is an open-source software.

8. If you did not report all the analyses you tried (transformations, tests, selections of variables, models, etc.) before arriving at the one you chose to emphasize
I reported on all the analyses including data transformations, tests, selections of variables, alternative models and failed activities.

9. If you did not make your code (including tests) available
Information on how the code is licensed is provided. The code will be made available using Github https://github.com/vernongayle .

10. If you did not make your data available (and a law like FERPA or HIPPA doesn’t prevent it)
The data cannot be made publically available but researchers can assess the data from the UK Data Service https://www.ukdataservice.ac.uk/ .

11. If you did not record and report the data format
A description of the research dataset and well as information on the data format and the time and date of the dowload are provided (similar information is provided for the Croxford et al. (2007) dataset which is used to harvest an alternative social class measure).

12. If there is no open-source tool for reading data in that format
The code to read the data, wrangle the data and produce all of the results is written in R which is open-source and is provided in a Jupyter notebook which is also open-source and will be made available using the open-source platform Github https://github.com/vernongayle.

13. If you did not provide an adequate data dictionary
A data dictionary (or codebook) is provided.

14. If you published in a journal with a paywall and no open-access policy
The work has not yet been published. But it will be available through the UK green open access policy via my university repository http://www.research.ed.ac.uk/portal/en/persons/vernon-gayle(682d7da1-a2ad-49f0-b36c-64478c658f99).html .


Conclusions

In conclusion Stark's Reproducibility Checklist provides an important set of benchmarks, and they can reasonably be regarded as a

Berkelium Standard (i.e. beyond gold ).

The items on Stark's checklist represent solid targets to aim for.

Given the present research culture in sociology, the programing skills, and the data analytical capabilities of researchers, the items on Stark's Reproducibility Checklist probably represent too large a step forward at the current time.

Therefore in the next section I posit Some Newer Rules of the Sociological Method which might act as a more immediate and practicable set of guidelines for undertaking reproducible sociological research using large-scale and complex social surveys and administrative datasets.

Some Newer Rules of the Sociological Method

The ultimate goal: The providence of every result should be clear and as open as possible.

The overall aim: There should be enough suitable information available to completely duplicate results, without having to contact the authors.

Here are 5 broad ‘Newer Rules of the Sociological Method’ that are tailored to the analysis of large-scale and complex social science datasets.

One - Use established data analysis software (e.g. Stata, SPSS, or R), and clearly state the version, libraries, dependencies and plugins.

Two - Clearly identify the version of the dataset and its origins (i.e. where and when it was obtained).

Three - Write down all of the code for how the data were prepared for analysis, in a format that it can easily be read by someone unconnected with the project.

Four - Write down all of the code for all of the analyses undertaken and not just the analyses that are presented, in a format that it can easily be read by someone unconnected with the project.

Five - Archive the material in an accessible format at a reachable location.

Within the archive

a) Provide suitable auxillary information describing the contents of the archive, so that in future a third party unconnected with the project can understand the materials.
b) Provide a detailed codebook.
c) Make available all of the research code and information generated within the workflow.

The archived materials should be openly available.

Try to use recognised file formats and think about how best to help a third party who is unconnected with the project understand the contents of the archive at some time in the future.

5 Simple Newer Rules of the Sociological Method

1. Tell us about your software

2. Tell us about your data

3. Show us how you got your data ready

4. Show us all the analysis you did

5. Save all of this work openly

The overall motivation of this work was to explore the practicability of using Stark’s ‘reproducibility check list’ in a piece of sociological research using genuine large-scale social science data.

The work on this project provides a striking reminder of the large amount of data enabling (i.e. data wrangling) that is required to duplicate a relatively straightforward published result. Despite knowing the data resource relatively well, duplicating a logit model with only three explanatory variables took me effort and some detective work. The conclusions that are drawn are the result of what is an early exploration. After further reflection and discussions they are likely to be refined. As they currently stand my conclusions are unlikely to be the last word on the subject of undertaking reproducible social science using large-scale and complex datasets.

In this section I will reflect on the items on Stark’s checklist and comment on their relevance and feasibility for sociological research using large-scale social science datasets.

1. If you relied on Microsoft Excel for computations, fail.

There is little justification for using a spreadsheet to undertake analyses of large-scale social science datasets. It is almost impossible to provide and document a clear audit trail when using a spreadsheet. The now well-known case of the errors in the spreadsheet-based calculations made in Reinhart and Rogoff (2010a; 2010b) which were reported by Herndon, Ash and Pollin (2014) should serve as a stern warning against using spreadsheets in social science data analyses. In addition Stark points to the more general problems of bugs in spreadsheet software (see also http://eusprig.org/horror-stories.htm).

2. If you did not script your analysis, including data cleaning and munging, fail.

Scripting the workflow is integral to successful social science data analysis. Having a planned and organised workflow is indispensable to producing high-quality social science research. Long (2009) provides an authoritative account of good practices in the social science data analysis workflow. More recently these principles have been distilled in Gayle and Lambert (2017). In practice large-scale social science datasets are almost never delivered in an immediately ‘analysis-ready’ format. The data analyst will almost always have to undertake some activities to enable the data for analysis. I use the term ‘data enabling’ to describe the stage between downloading the social science dataset (for example from a national archive) and beginning to undertake statistical analyses. ‘Data enabling’ comprises tasks associated with preparing and enhancing data for statistical analysis, such as recoding measures, constructing new variables and linking datasets (Blum et al., 2009; Lambert and Gayle, 2008). ‘Data enabling’ is a substantial part of the research process but its importance is often overlooked. The time required to ‘enable data’ is frequently underestimated, even by more experienced social science data analysts. An audit trail, which acts as a set of breadcrumbs is essential for navigating back through data enabling aspects of the workflow, and is therefore essential for determining the provenance of results. A scripted workflow is essential for accurate, efficient, transparent and reproducible social science research.

3. If you did not document your code so that others can read and understand it, fail.

Documenting research code is central to delivering reproducible work. The concept of making the workflow 'literate' is new within sociological research. The idea of producing explanations of the thinking behind individual steps in the workflow is novel. Producing commentaries in human readable language (e.g. plain English) interwoven between research code and outputs is innovative. The material produced above shows promising signs that this approach will pay dividends in making research endeavours more transparent and therefore reproducible. I am mindful of the old saying ‘that a recipe is not a recipe until someone else has cooked it’. A thoroughgoing proof of the literacy and the transparency of research code is only achieved when a third party, who is unconnected with the work, has successfully followed and executed the code. As a result of this position I am increasingly advocating activities such as the 'pair production' of research code, and 'peer reviewing' of research code. These activities will represent a marked change in how sociological research using large-scale datasets is routinely undertaken. If these activities are taken-up, and taken seriously, they will have consequences for how research teams undertake work, and how new researchers are trained (and existing researchers are re-trained).

4. If you did not record and report the versions of the software you used (including library dependencies), fail.

This is easily achieved, and can prove to be critical later when a researcher is trying to ‘duplicate’ the work (i.e. produce identical results). The exact results reported in table 5 Connolly (2006 p.20) could not immediately be duplicated even though identical variables were constructed. It took some detective work to ascertain that the work was undertake using SPSS in a specific mode. Since many analyses use special libraries and routines it is important that they are precisely documented so that results can be duplicated and ultimately be checked and validated.

5. If you did not write tests for your code, fail.

This is a sensible requirement, however because many sociological analysis employ standard and routine methods it may be too stringent a requirement for every single sociological analysis. In this present analysis I compared the results of two methods, which were then used in the analysis, against existing published results. Stark suggests that you should test your software every time you change it. This is a sensible and reasonable precaution, and when network versions of software are changed or updated, universities and research institutions should re-test their software.

6. If you did not check the code coverage of your tests, fail.

Stark suggests that this would be a good practice but states that he has never seen a scientific publication that does so. As far as I understand it, in computer science, code coverage is a measure used to describe the degree to which the source code of a program is executed when a particular test suite (a set of cases intended to be used to test a software program to show that it has some specified set of behaviours) runs. In theory a program with high code coverage has had more of its source code executed during the testing which might suggest it has a lower chance of containing undetected errors. On reflection few sociological researchers develop new statistical tests or need to implement statistical tests within new software routines. Therefore this requirement is probably irrelevant to most mainstream sociological analyses using large-scale datasets. For researchers who are developing new tests or constructing new routines then testing the coverage of their code and clearly documenting it would be a sensible action.

7. If you used proprietary software that does not have an open-source equivalent without a really good reason, fail.

It is unrealistic to undertake anything more than extremely basic analyses of survey data without using data analysis software. The requirement to use non-proprietary software however is likely to prove controversial within the community of sociological researchers using large-scale datasets.
The freeware R provides a viable approach with a substantial volume of analytical options and considerable programming flexibility (Long, 2011). I have shown in this analysis that R can be used in a standard piece of sociological inquiry. The UK Data Service currently provides datasets in SPSS and Stata format. These formats can be read in R. The UK Data Service provides data in a more package agnostic tab-delimited format. Some R users advocate importing data in this format. In my experience this format can prove challenging especially when matching and merging files and undertaking data analysis enabling tasks.
I am a sociologist who has been undertaking research with large-scale and complex datasets for nearly a quarter of a century, and I have taught data analysis to undergraduate and post-graduate students, early career researchers and non-academic researchers. In my experience for sociology students the R learning curve is steep. The skills which are necessary to effectively exploit R through textual programming seem unlikely to lead to its universal adaptation amongst the wide ranging user-communities within the social sciences (see Lambert et al., 2015). A limitation is that R is currently not well suited to the analysis of large-scale social surveys. For example when using R it is difficult to effectively combine the numeric codes for variables along with both their value and variable labels. This means that users are not able to effectively exploit the meta-information on measures that is helpful for routine survey data analysis tasks. A current limitation of R is that there is a lack of clear and concise help files which contain applied examples that relate to the analysis of large-scale and complex social science datasets.
Within this research example I have undertaken a small amount of analysis using Python which is an emerging open-source alternative to R. I was unable to undertake a survey weighted analysis using a logistic regression model, but this may in part be due to my lack of competence with this software. A severe limitation of Python is that there is very little help and almost no applied examples that relate to the analysis of large-scale and complex social science datasets. At the current time there are fewer statistical routines and libraries available in Python, and Python does not offer an alternative to many packages that are available in R. Python is a widely used high-level programming language for general purpose programming. Python is emerging as a valuable tool in data science (e.g. for example web scrapping). In future it might unfold as a viable software for the analysis of large-scale social science datasets.
I have generally been an advocate of using Stata for the analysis of large-scale and complex social science datasets (see Gayle and Lambert 2017). Stata stands out as a sensible choice because it is a popular commercial package with a wide community of social science users especially in economics. Stata is specifically orientated to the analysis of social survey datasets. Over the years I have observed that the Stata learning curve is less steep than the R learning curve. Stata has very good documentation. Within Stata there are a wide range of analytical capabilities, and ongoing developmental activities (see Lambert et al., 2015). Therefore I have concluded that overall it is the single most effective and efficient tool for undertaking and successfully completing survey data analysis. The tasks associated with data enabling, exploratory data analyses, building statistical models and organising presentation-ready and publication-ready outputs (by which I mean high-quality graphs and tables of modelling results), can all be undertaken using Stata in a single uniformed environment. The development of a Stata kernel in Jupyter, and the ability to use Stata via magic cells (as demonstrated above) illustrate how the software can effectively be used within a notebook. This is attractive for developing transparent research and bundling it within a unified research object.
SPSS is a fairly ubiquitous within sociology departments. It is suited to the analysis of large-scale datasets but compared with Stata it is far more restricted in the range of statistical models that it can estimate. SPSS currently has fewer options for estimating models that are appropriate for longitudinal data. Stata is able to offer more comprehensive facilities to analyse survey datasets with complex designs and selection strategies. This is a clear benefit for social scientists working with contemporary datasets such as the UK Household Longitudinal Study (Understanding Society) and the UK Millennium Cohort Study.
In practice, given the current research climate within sociology, the programing knowledge and levels of data analysis skill, the requirement to abandon proprietary software is probably too impractical a step. The requirement could be relaxed to using an established mainstream data analysis software (e.g. Stata, SPSS, R or SAS), but the data enabling and the data analysis must be scripted in as ‘literate’ a fashion as possible. This is essential so that a third party who is unconnected with the project can follow and understand the workflow. Where possible it would be a good practice to augment the work by reporting how an open-source analysis could be undertaken in order to assist in the duplicating (and therefore the checking) results. In practice this might mean undertaking the data enabling and analysis in Stata but documenting how the work could also be reproduced in R or Python.

8. If you did not report all the analyses you tried (transformations, tests, selections of variables, models, etc.) before arriving at the one you chose to emphasize, fail.

Providing access to the complete workflow is an indispensable aspect of rendering sociological analysis transparent and reproducible. The use of Jupyter notebooks is a concrete example of organising or bundling the elements of the workflow into a ‘research object’ (see http://www.researchobject.org/). The use of Jupyter notebooks in sociological research extends the possibilities of material being Findable, Accessible, Interoperable and Reusable (FAIR) which is a tenet of reproducible science.

9. If you did not make your code (including tests) available, fail.

Stark states that your code should also state how it is licensed. This is a new departure in sociological research. There are a series of licenses that would be appropriate to this activity and that would chime with the wider academic ideas of attribution. In this present work I have chosen to use the MIT License. Stark further asserts that code should be published in a way that makes it easy for others to check, re-use and extend, for example by publishing it using services like Git repositories. At the current time very few sociological analyses of large-scale and complex datasets have reported all the code used to enable data and then to undertake the analysis.
Few sociological studies have used repositories. Git repositories are primarily used for source code management in software development, but can be used to keep track of changes in any set of files. These services are sometime referred to as version control software (VCS). Gentzkow and Shapiro (2014) is a rare example of VCS being recommended in the social sciences. Mercurial is an alternative to Git and, whilst GitHub has been used in this example other approaches such as BitBucket provide similar services.

10. If you did not make your data available (and a law like FERPA or HIPPA doesn’t prevent it), fail.

Access to data is an integral part of transparent and reproducible social science research. The accessibility of data presents an obstacle for sociologists working with large-scale datasets. Much of the sociological analysis undertaken using large-scale and complex datasets is secondary analysis of general (or omnibus) data resources. These data resources are often national level surveys (for example the US Panel Study of Income Dynamics or the British Household Panel Survey) or data collected as part of national level Censuses. These data do not ‘belong’ to the data analyst and are usually provided by a national archive or other data provider under some form of ‘end user license’. In practice these data are made available for research but cannot be freely shared, and all users must formally registered for the data. The rules and regulations of data use vary across countries, between data providers, and between datasets. Administrative data resources (e.g. education records) usually have tighter controls placed on their use. Sensitive or confidential data (specially relating to health) are usually especially securely controlled. Unless the data have been collected by the sociologist, and are owned and controlled by them it is unlikely that they will be able to freely share the data that have been analysed in a particular piece of work. Therefore in order to facilitate transparent and reproducible work sociologists should provide as much information on the dataset (including detailed information on versions and downloads) in order to allow a third party to get access to the data that were genuinely used in the analysis.

11. If you did not record and report the data format, fail.

In order to facilitate transparent and reproducible work sociologists should provide as much information on the dataset (including detailed information on versions and downloads) in order to allow a third party to get access to the data that were genuinely used in the analysis. This is especially important when the data are not freely available and have to be accessed via a national repository or through a data provider (see point 10 above).

12. If there is no open source tool for reading data in that format, fail.

This point is critical when datasets are being made available alongside other research objects. In short, if data are unreadable then they do not add to transparency or reproducibility. In the case of secondary analysis of existing large-scale dataset that have been provided by national data archives it is important that the code to read the data, to enable the data, and to produce all of the results is written in an accessible way. In this current project I have used R which is open-source and the research code is provided in a Jupyter notebook which is also open-source and is made available using the open-source platform Github https://github.com/vernongayle.

13. If you did not provide an adequate data dictionary, fail.

Providing an adequate data dictionary is a relatively easy task but it is not currently a ubiquitous practice. The acid test of a data dictionary is how easily it can be read, and how useful it is for working with the data for a third party who is unconnected with the project.

14. If you published in a journal with a paywall and no open-access policy, fail.

In the pursuit of transparent and reproducible sociological research having open access to published work is critical. Stark suggests that posting the final version of your paper on a reprint server might be enough, but he thinks that it is time to move to open scientific publications. He further states that most publishers he has worked with have let him mark up the copyright agreements to keep copyright and grant them a non-exclusive right to publish. In the context of UK higher education research, the move to Green open access will improve the accessibility of published work. Green open access involves publishing in a traditional subscription journal as usual, but also ‘self-archiving’ in a repository (e.g. a university archive or external subject-based repository) and providing free access (although this might be after an embargo period set by the publisher). The UK Research Council which funds research has a preference for immediate, unrestricted, on‐line access to peer‐reviewed and published research papers, free of any access charge and with maximum opportunities for re‐use. This is commonly referred to as Gold open access (see http://www.rcuk.ac.uk/documents/documents/rcukopenaccesspolicy-pdf/).


References

Blum, J.M., Warner, G., Jones, S., Lambert, P., Dawson, A., Tan, K.L.L. and Turner, K.J., 2009. Metadata creation, transformation and discovery for social science data management: The DAMES Project infrastructure. IASSIST Quarterly, 33(1), pp.23-30.

Gayle, V. and Lambert, P., 2017. The Workflow: A Practical Guide to Producing Accurate, Efficient, Transparent and Reproducible Social Survey Data Analysis, Working Paper 1/17 UK National Centre for Research Methods, http://eprints.ncrm.ac.uk/4000/.

Gentzkow, M. and Shapiro, J., 2014. Code and data for the social sciences: A practitioner’s guide, University of Chicago mimeo. Available at: https://web.stanford.edu/gentzkow/research/CodeAndData.pdf (accessed 13th December 2016).

Herndon, T., Ash, M. and Pollin, R., 2014. Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge Journal of Economics, 38(2), pp.257-279.

Lambert, P., Browne, W. and Michaelides, D, 2015. Contemporary developments in statistical software for social scientists, in Procter, R. and Halfpenny, P. (eds) Innovations in Digital Research Methods. London: Sage.

Lambert, P. and Gayle, V., 2008. Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008, DAMES Project Technical Paper 3.

Long, J.D., 2011. Longitudinal data analysis for the behavioral sciences using R. New York: Sage.

Long, J.S. and Long, J.S., 2009. The workflow of data analysis using Stata. College Station, TX: Stata Press.

Reinhart, C. and Rogoff, K., 2010A. Growth in a Time of Debt, Working Paper no. 15639, National Bureau of Economic Research, http://www.nber.org/papers/w15639.

Reinhart, C. and Rogoff, K., 2010B. Growth in a Time of Debt, American Economic Review, vol. 100, no. 2, 573–8


A Little Light Relief

My Jupyter Limerick

A researcher with time to fritter

Decided he didn’t need Jupyter

His results he would show

Without a traceable workflow

Could a researcher be any stupider?


Converting this Jupyter Notebook into Portable Formats

see http://nbconvert.readthedocs.io/en/latest/

  1. At the cmd prompt conda install nbconvert
  2. Change directory (for example my directory is C:\Users\Vernon
  3. Type jupyter nbconvert --to html mynotebook.ipynb

Acknowledgements

I would like to thank Philip Stark for his insightful presentation which greatly helped to crystalize my thoughts on reproducibility and the data analysis workflow, and in which Philip very effectively set out his reproducibility rules.

I would also like to thank Philip for introducing me to Fernado Perez who kindly invited me to BIDS which proved to be very insightful. Min RK deserves a special mention for his initial help installing the Stata kernel.

Closer to home I would like to thanks Roxanne Connelly, Chris Playford, Yuji Shimohira Calvo, Kevin Ralston and Alasdair Gray.


The work is very exploratory.

Positive comments are always appreciated, but brickbats improve work.

vernon.gayle@ed.ac.uk or @profbigvern


Copyright (c) 2017 Vernon Gayle, University of Edinburgh

END OF NOTEBOOK