Unit 2: Programming Design

Lesson 14: Packages and Data Analysis

Scientific Context: Reproducible Research

There has been a lot of disussion lately in the scientific community and the media about a "reproducibility crisis" in science. Two relatively high profile examples over the years in medicine and psychology have really led the way in making the case for "Reproducible Research" but there is a strong argument for being able to reproduce lots of different kinds of computational science (including computational biology - this one even talks about Jupyter Notebooks). Hopefully your science professors and mentors have impressed upon you the long tradition of scientific note taking, this idea extends to more complex analysis of complex data.

Required Reading:
Ten Simple Rules for Reproducible Computational Research

Science Fundamentals: (Reproducible) Data Analysis

Usually at this point we would have to try to convince you that Jupyter Notebooks and Python are a good tools for doing reproducible computational research, but you're already convinced of that (or stuck with them). So we'll move along to the data.

All fields of sceince have some of their data in what we will call a data frame but is more commonly known as a table or spreadsheet. Here is an example:

sample      generation   genome_size
REL606          0        4.62
REL1166A     2000        4.63
ZDB409       5000        4.60
ZDB429      10000        4.59
ZDB446      15000        4.66
...

Pandas

The Python package called pandas "is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language" and has become the defacto standard for analyzing data in a DataFrame format using Python (pandas reminds me a lot of the dplyr package in R).

Our Lesson 14 class activity will be to use the pandas library of tools to analyze a DataFrame.

Pre-Activity Questions

1. Define "Reproducible Research" in your own words. Make sure to cite sources you used to come to this definition.

2. Provide a description/explanation of a field-specific example of both:
2a. Reproducible Research, and
2b. Not Reproducible Research.

Make sure to cite sources you used to answer this question.

3. Find a "data frame" that has data from your field in it that has at least 3 columns (variables) and 20 rows (observations).
3a. Save this file as a .csv (comma separated value) file on your computer. (no need to answer here)

3b. Describe the data, purpose, context, etc of the data. Make sure to cite the sources of information that you use and provide a link to the data itself.

3c. Paste the first few rows below (like the example above):

your data here
...