author: lukethompson@gmail.com
date: 5 October 2017
language: Python 3.5
license: BSD3
ORDER OF SCRIPTS:
This notebook takes two inputs:
It then merges them with just the studies and fields (rows and columns) pertaining to the paper.
Output file emp_studies.csv is input for metadata_refine_step2_samples.ipynb, which generates emp_studies_no_controls_YYYYMMDD.tsv for Extended Data Table 1 in the paper.
In [1]:
import pandas as pd
In [2]:
path1 = '../../data/metadata-refine/emp_consortium_gsheet.xlsx'
path2 = '../../data/metadata-refine/emp_studies_prepandas.xlsx'
path_output = '../../data/metadata-refine/emp_studies.csv'
In [3]:
df1 = pd.read_excel(path1, converters={
'study_id':int,
'num_samples':int,
'pmid':str
})
df2 = pd.read_excel(path2, converters={
'read_length_bp':int,
'study_ok':bool,
'release1_study':bool,
'release2_study':bool,
'emp_paper':bool,
'metadata_minimal':bool})
In [4]:
df_merged = pd.merge(df1, df2, left_on='study_id', right_on='study_id')
In [5]:
df_merged.sort_values('study_id').reset_index(drop=True).to_csv(path_output)
In [ ]: