author: lukethompson@gmail.com
date: 5 October 2017
language: Python 3.5
license: BSD3

ORDER OF SCRIPTS:

  1. metadata_refine_step1_studies.ipynb
  2. metadata_refine_step2_samples.ipynb
  3. metadata_refine_step3_qiita.ipynb

Generate table of studies for EMP meta-analysis

This notebook takes two inputs:

  1. Excel file of EMP Consortium list of studies (emp_consortium_gsheet.xlsx -- downloaded from a shared Google Sheet and renamed).
  2. Excel file of study quality info (emp_studies_prepandas.xlsx, including list of studies in EMP paper)

It then merges them with just the studies and fields (rows and columns) pertaining to the paper.

Output file emp_studies.csv is input for metadata_refine_step2_samples.ipynb, which generates emp_studies_no_controls_YYYYMMDD.tsv for Extended Data Table 1 in the paper.


In [1]:
import pandas as pd

In [2]:
path1 = '../../data/metadata-refine/emp_consortium_gsheet.xlsx'
path2 = '../../data/metadata-refine/emp_studies_prepandas.xlsx'
path_output = '../../data/metadata-refine/emp_studies.csv'

In [3]:
df1 = pd.read_excel(path1, converters={
        'study_id':int,
        'num_samples':int,
        'pmid':str
        })
df2 = pd.read_excel(path2, converters={
        'read_length_bp':int,
        'study_ok':bool,
        'release1_study':bool,
        'release2_study':bool,
        'emp_paper':bool,
        'metadata_minimal':bool})

In [4]:
df_merged = pd.merge(df1, df2, left_on='study_id', right_on='study_id')

In [5]:
df_merged.sort_values('study_id').reset_index(drop=True).to_csv(path_output)

In [ ]: