There has been a lot of disussion lately in the scientific community and the media about a "reproducibility crisis" in science. Two relatively high profile examples over the years in medicine and psychology have really led the way in making the case for "Reproducible Research" but there is a strong argument for being able to reproduce lots of different kinds of computational science (including computational biology - this one even talks about Jupyter Notebooks). Hopefully your science professors and mentors have impressed upon you the long tradition of scientific note taking, this idea extends to more complex analysis of complex data.
Required Reading:
Ten Simple Rules for Reproducible Computational Research
Usually at this point we would have to try to convince you that Jupyter Notebooks and Python are a good tools for doing reproducible computational research, but you're already convinced of that (or stuck with them). So we'll move along to the data.
All fields of sceince have some of their data in what we will call a data frame but is more commonly known as a table or spreadsheet. Here is an example:
sample generation genome_size
REL606 0 4.62
REL1166A 2000 4.63
ZDB409 5000 4.60
ZDB429 10000 4.59
ZDB446 15000 4.66
...
The Python package called pandas "is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language" and has become the defacto standard for analyzing data in a DataFrame format using Python (pandas
reminds me a lot of the dplyr
package in R).
Our Lesson 14 class activity will be to use the pandas
library of tools to analyze a DataFrame.
1. Define "Reproducible Research" in your own words. Make sure to cite sources you used to come to this definition.
2. Provide a description/explanation of a field-specific example of both:
2a. Reproducible Research, and
2b. Not Reproducible Research.
Make sure to cite sources you used to answer this question.
3. Find a "data frame" that has data from your field in it that has at least 3 columns (variables) and 20 rows (observations).
3a. Save this file as a .csv
(comma separated value) file on your computer. (no need to answer here)
3b. Describe the data, purpose, context, etc of the data. Make sure to cite the sources of information that you use and provide a link to the data itself.
3c. Paste the first few rows below (like the example above):
your data here
...