This notebook is for performing a targeted metabolomics analysis on your raw data files. In order for this to work, you data files must be uploaded to the nersc raw data directory with appropriate permissions. A fileconverter runs at nersc to convert as follows: .raw > .mzml > .h5 > pactolus.gz This conversion must happen before you can proceed with data analysis. A Northen Lab staff member can help you transfer and convert your files if they are not already at nersc.
/global/project/projectdirs/metatlas/raw_data/SUBFOLDER
A targeted analysis requires a list of compounds to search for in your sample files. Thus, at a minimum, you must have some information on their m/z ratio in order to get started. The inputs for this notebook, include the location of your raw files, and an atlas listing out these compounds.
Our lab has analyzed thousands of compounds using standardized LCMS methods. You may use one of these libraries to create an atlas for your sample analysis, assuming you used one of the same standard LCMS methods.
Analysis steps
There is a database that will store your atlas and registered run files. During the analysis process you will pull these as local variables. These are used to pull out EIC and MSMS data which is stored in a variable called metatlas_dataset. All output files are generated from that. It is a local variable so is not stored in the db - thus if your kernel dies partway through, you need to rerun through the notebook to regenerate that variable.
IMPORTANT: Anytime you adjust your RTs in the interactive plot, you need to retreive your atlas again before exporting any files. The changes are stored in the db and the exports are made from locally stored variables/dataframes.
In [ ]:
# optional: run this block to adjust the width of the notebook. Change the width percent.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
In [2]:
%matplotlib notebook
import sys, os
# v edit this line
sys.path.insert(0,'/global/homes/FIRST-INITIAL-OF-USERNAME/USERNAME/REPOFOLDER/metatlas-master-20200416/metatlas-master/')
# ^ edit this line
from metatlas.helpers import fastanalysis as fa
from metatlas.helpers import dill2plots as dp
from metatlas.helpers import metatlas_get_data_helper_fun as ma_data
from metatlas.helpers import chromplotplus as cpp
import metatlas.metatlas_objects as metob
import qgrid
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import display
import time
import pickle
import dill
import multiprocessing as mp
import pandas as pd
import matplotlib.pyplot as plot
import operator
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 100)
In [3]:
# check that first block registered metatlas directory correctly. If it does not look the same, seek help!
print fa.__file__
STEP 1
STEP 2
In [5]:
pwd
Out[5]:
In [5]:
#STEP 1
project_directory='/global/homes/FIRST-INITIAL-OF-USERNAME/USERNAME/PROJECTDIRECTORY/' # <- edit this line, do not copy the path directly from NERSC (ex. the u1, or u2 directories)
output_subfolder='HILIC_POS_20190830/' # <- edit this as 'chromatography_polarity_yyyymmdd/'
output_dir = os.path.join(project_directory,output_subfolder)
if not os.path.exists(project_directory):
os.makedirs(project_directory)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
#STEP 2
pathtoatlas='/global/homes/FIRST-INITIAL-OF-USERNAME/USERNAME/DIRECTORYOFATLAS/' # <- enter the directory where you have stored your atlas
#pathtoatlas= '%s%s' % (project_directory,output_subfolder)
The groups are created from common file headers and the unique group names. The convention our lab group uses for filenames is as follows:
DATE_NORTHENLABINITIALS_COLLABINITIALS_PROJ_EXP_SAMPSET_SYSTEM_COLUMN-method_SERIAL_POL_ACQSAMPLENUMBER SAMPLEGROUP_REP_OPTIONAL_SEQ
Ex.:20180105_SK_AD_ENIGMA_PseudoInt_R2ADec2017_QE119_50454_123456_POS_MSMS_001_Psyringae-R2A-30C-20hr_Rep01_NA_Seq001.raw
The common header consists of the fields 0-10: DATE_NORTHENLABINITIALS_COLLABINITIALS_PROJ_EXP_SAMPSET_SYSTEM_COLUMN-method_SERIAL_POL_ACQ
The sample group name is commonly field # 12 (between underscore 11 and 12) -0 indexed-
In [ ]:
files = dp.get_metatlas_files(experiment = '%ENTERSTRING%',name = '%ENTERSTRING%',most_recent = True)
# ^ edit the text string in experiment and name fields
df = metob.to_dataframe(files)
my_grid = qgrid.QGridWidget(df=df[['experiment','name','username','acquisition_time']])
my_grid.export()
In [ ]:
len(files)
This will attempt to create groups in an automated fashion (rather than filling out a spreadsheet with a list of files and group names). If your files are all in one folder at nersc, you can use this options. If not, use option B below.
A long group name consisting of the common header + either controlled vocab value or field #12 along with a short group name (just controlled vocab or field #12) will be stored in a local variable. The short group names can be used on plots.
In [ ]:
#STEP 1: View the groups
files = metob.retrieve('lcmsruns',experiment='%ENTERSTRING%',username='*')
controlled_vocab = ['QC','InjBl','ISTD'] #add _ to beginning. It will be stripped if at begining
version_identifier = 'vrs3'
file_dict = {}
groups_dict = {}
for f in files:
k = f.name.split('.')[0]
# get index if any controlled vocab in filename
indices = [i for i, s in enumerate(controlled_vocab) if s.lower() in k.lower()]
prefix = '_'.join(k.split('_')[:11])
if len(indices)>0:
short_name = controlled_vocab[indices[0]].lstrip('_')
group_name = '%s_%s_%s'%(prefix,version_identifier,short_name)
short_name = k.split('_')[9]+'_'+short_name # Prepending POL to short_name
else:
short_name = k.split('_')[12]
group_name = '%s_%s_%s'%(prefix,version_identifier,short_name)
short_name = k.split('_')[9]+'_'+k.split('_')[12] # Prepending POL to short_name
file_dict[k] = {'file':f,'group':group_name,'short_name':short_name}
groups_dict[group_name] = {'items':[],'name':group_name,'short_name':short_name}
df = pd.DataFrame(file_dict).T
df.index.name = 'filename'
df.reset_index(inplace=True)#['group'].unique()
df.drop(columns=['file'],inplace=True)
for ug in groups_dict.keys():
for file_key,file_value in file_dict.items():
if file_value['group'] == ug:
groups_dict[ug]['items'].append(file_value['file'])
df.head(100)
In [ ]:
#STEP 2: create the groups variable, if the above looks OK
groups = []
for group_key,group_values in groups_dict.items():
g = metob.Group(name=group_key,items=group_values['items'],short_name=group_values['short_name'])
groups.append(g)
for item in g.items:
print(g.name,g.short_name,item.name)
print('')
In [ ]:
# STEP 3 Option A: store the groups variable content in the DB (currently only the long group name is stored)
metob.store(groups)
In [ ]:
## STEP 3 Option B-I: OPTIONAL: Export groups to csv file for editing (filename, short_name, group, description)
#dp.make_prefilled_fileinfo_sheet(groups,os.path.join(output_dir,'prefilled_fileinfo.tab'))
In [ ]:
## STEP 3 Option B-II: Import groups from csv file after editing the prefilled_fileinfo.tab
#groups = dp.make_groups_from_fileinfo_sheet(os.path.join(output_dir,'prefilled_fileinfo.tab'), filetype='tab', store=True)
Typically, you will make one fileinfo sheet with all of your files (pos and neg) for this experiment. At a minimum, group names MUST contain the first 11 underscore delimited fields (DATE_NORTHENLABINITIALS_COLLABINITIALS_PROJ_EXP_SAMPSET_SYSTEM_COLUMN-method_SERIAL_POL_ACQ) and the 'SAMPLEGROUP' field.
Files can be from multiple folders at nersc.
In [ ]:
#STEP 1: Select files
files = dp.get_metatlas_files(experiment = '%ENTERSTRING%',name = '%ENTERSTRING%',most_recent = True)
# ^ edit the text string in experiment and name fields
In [ ]:
#STEP 2: Save spreadsheet file
dp.make_empty_fileinfo_sheet('%s%s' % (output_dir,'empty_fileinfo.tab'),files)
In [ ]:
#STEP 3: create groups from file
g = dp.make_groups_from_fileinfo_sheet('%s%s' % (output_dir,'filled_fileinfo.txt'),
filetype='tab',
store=True)
In [ ]:
# STEP 4: check groups
metob.to_dataframe(g)
Here, you will assign your database groups to a local variable which will be used downstream in the notebook for analyzing your data with an atlas.
In [ ]:
groups = dp.select_groups_for_analysis(name = '%ENTERSEARCHSTRING%', # <- edit text search string here
most_recent = True,
remove_empty = True,
include_list = [], exclude_list = ['NEG','QC','InjBl'])# ex. ['QC','Blank'])
print("sorted groups")
groups = sorted(groups, key=operator.attrgetter('name'))
for i,a in enumerate(groups):
print(i, a.name)
In [ ]:
# to view metadata about your groups, run the block below
metob.to_dataframe([groups])
Required Atlas file headers:
inchi_key,label,rt_min,rt_max,rt_peak,mz,mz_tolerance,adduct,polarity,identification_notes
values in rows must be completed for all fields except inchi_key (leaving this blank will not allow you to perform MSMS matching below), and identification notes
INFO: store=True will register your atlas in the database. If you are not sure if your atlas structure is correct, set store=False for the first time your run the block to check if you get an error. If there is no error, then rerun it with store=True.
In [ ]:
atlasfilename='EMA_SK_20190703_UVMoss_Chamber0206_SolUVAC_QE119_HIL_NEG_20190829_1735' # <- enter the exact name of your csv file without the file extension
names = dp.make_atlas_from_spreadsheet('%s%s%s' % (pathtoatlas,atlasfilename,'.csv'), # <- DO NOT EDIT THIS LINE
atlasfilename,
filetype='csv',
sheetname='',
polarity = 'negative',
store=True
#,mz_tolerance = 20 #Uncomment this line if you want to override the mz_tolerance values in your atlas, otherwise, the code will use your values in the csv sheet
)
In [ ]:
atlasfilename='EMA_SK_20190703_UVMoss_Chamber0206_SolUVAC_QE119_HIL_POS_20190829_1735' # <- enter the exact name of your csv file without the file extension
names = dp.make_atlas_from_spreadsheet('%s%s%s' % (pathtoatlas,atlasfilename,'.csv'), # <- DO NOT EDIT THIS LINE
atlasfilename,
filetype='csv',
sheetname='',
polarity = 'positive',
store=True
#,mz_tolerance = 20 #Uncomment this line if you want to override the mz_tolerance values in your atlas, otherwise, the code will use your values in the csv sheet
)
In [ ]:
atlases = metob.retrieve('Atlas',name='%UVMoss_Chamber0206_SolUVAC%POS%',username='YOUR-NERSC-USERNAME')
names = []
for i,a in enumerate(atlases):
print(i,a.name,pd.to_datetime(a.last_modified,unit='s'))#len(a.compound_identifications)
In [ ]:
my_atlas = atlases[0]
atlas_df = ma_data.make_atlas_df(my_atlas)
atlas_df['label'] = [cid.name for cid in my_atlas.compound_identifications]
print my_atlas.name
metob.to_dataframe([my_atlas])
# the first line of the output will show the dimensions of the atlas dataframe
In [ ]:
# OPTIONAL: to view your atlas, run this block
print my_atlas.name
atlas_df
The EIC data contains mz, intensity and RT values across your RT range. There are two parameters that you will need to edit: extra_time and extra_mz. Extra time will collect mz, intensity and RT values from outside of your atlas defined min and max rt values. For example if your rt_min = 1.0, and rt_max = 2.0 and you set extra_time to 0.3, then your new rt range will be 0.7 to 2.3. This is helpful for checking if you have nearby peaks at the same m/z. Extra_mz should only be used for troubleshooting. You should keep this at 0 unless you believe you have poor mass accuracy during your run. Other ways to address this issue is by changing the mz_tolerance values in your atlas. Before changing this value, you should check in with a metatlas experienced lab member to discuss when/how to use this value.
On Your final runthrough, set extra_time to 0
In [ ]:
all_files = []
for my_group in groups:
for my_file in my_group.items:
extra_time = 0.1
extra_mz = 0.00
all_files.append((my_file,my_group,atlas_df,my_atlas,extra_time,extra_mz))
pool = mp.Pool(processes=min(4, len(all_files)))
t0 = time.time()
metatlas_dataset = pool.map(ma_data.get_data_for_atlas_df_and_file, all_files)
pool.close()
pool.terminate()
print time.time() - t0
In [ ]:
# Make data sources tables (atlas_metadata.tab, groups_metadata.tab, groups.tab and [atlasname]_originalatlas.tab within data_sources subfolder)
ma_data.make_data_sources_tables(groups, my_atlas, output_dir)
There are two options for generating the hits variable:
In [ ]:
##BLOCK A
t0 = time.time()
hits=dp.get_msms_hits(metatlas_dataset,extra_time=True,keep_nonmatches=True)
pickle.dump(hits, open(os.path.join(output_dir,'hits.pkl'), "wb"))
print time.time() - t0
print '%s%s' % (len(hits),' <- total number of MSMS spectra found in your files')
In [ ]:
## BLOCK B (uncomment lines below to run this. Only use when all data files are MS1)
#hits=pd.DataFrame([], columns=[‘database’,‘id’,‘file_name’,‘msms_scan’, u’score’, u’num_matches’, u’msv_query_aligned’, u’msv_ref_aligned’, u’name’, u’adduct’, u’inchi_key’, u’precursor_mz’, u’measured_precursor_mz’])
#hits.set_index(['database','id','file_name','msms_scan'], inplace=True)
In [ ]:
# Optional: If you already have a pickled hits file and do not need to run get_msms_hits again, uncomment this block
# hits = pickle.load(open(os.path.join(output_dir,'hits.pkl'), "rb"))
This block creates an interactive plot. The top panel displays MSMS from within the two green RT bounds selected below (rt min and max, initially set in atlas). When the database holds reference spectra, mirror plots are generated with the reference spectra inverted below the sample spectra. The lower panel displays the EICs overlayed for all of the files in your selected groups. You can highlight your groups different colors. It is recommended that you do this, at least, for your extraction blank (or if not available, use a solvent injection blank). This plot also displays radio buttons that can be interactively selected; the values will be exported in your final identifications table and in your atlas export. Use these to mark peak/MSMS quality.
How to use:
colorlist = [['color1nameorhexadec','partialgroupstring1'], ['color2nameorhexadec','partialgroupstring2']]
- You can add more comma delimited colors/groups as needed.
- These are partial strings that match to you file names (not your group names).
- The order they are listed in your list is the order they are displayed in the overlays (first is front, last is back)
- Named colors available in matplotlib are here: https://matplotlib.org/3.1.0/gallery/color/named_colors.html or use hexadecimal values '#000000'
</ul> B. Option B (default EIC colors): comment out the custom colorlist lines and uncomment the default colorlist = "". Colors all default to black.
TIPS: use compound_idx = 0 in step 3 to change to a different compound in your atlas using the index number. If your plot does not fit in your browser window, adjust height and width values. Use alpha to change the transparency of the lines this is a value 0 (transparent) to 1 (opaque).
DO NOT change your RT theoretical peak (the purple line). It is locked from editing (unless you change a hidden parameter) and only to be changed in special cases. The measured retention times of your peaks will be calculated and exported in your output files. These will be compared with the RT theoreticals and used in your evidence of identification table.
In [ ]:
###STEP 1: Set the peak flag radio buttons using one of the two lines below, for custom flags or default flags
peak_flag_list=('','L1+ - 1 pk, good RT&MSMS','L1+ - known isomer overlap','L1+ - 1 pk, good RT, MSMS ok (coisolated mz/partial match/low int)','L1+ - 1 pk, good RT&MSMS from external library','L1 - 1 pk, correct RT, no MSMS or int too low for matching','L1 - 1 pk, good RT, very low intensity/poor pk shape','L2 put comp','L3 putative class','Remove - background/noise','Remove - bad EMA MSMS','Remove - bad MSMS NIST/MONA/Metlin')
#peak_flag_list ="" # this will default to ('keep','remove','unresolvable isomers','poor peak shape')
###STEP 2: Set the EIC line colors using on of the two lines below, for custom colors or default
colorlist= [['red','peas'],
['green','beets']]
#colorlist="" # this will default to black
###STEP 3
a = dp.adjust_rt_for_selected_compound(metatlas_dataset, msms_hits=hits,peak_flags=peak_flag_list, color_me = colorlist, compound_idx=0,alpha=0.5,width=15,height=4.5)
In [ ]:
atlas_identifications = dp.export_atlas_to_spreadsheet(my_atlas,os.path.join(project_directory,output_subfolder,'%s%s.csv' % (my_atlas.name,"export")))
print my_atlas.name
This block creates a number of files:
THe kwargs below will set the filtering points for the parameters indicated.
In [ ]:
kwargs = {'min_intensity': 1e4, # strict = 1e5, loose = 1e3
'rt_tolerance': .5, #>= shift of median RT across all files for given compound to reference
'mz_tolerance': 20, # strict = 5, loose = 25; >= ppm of median mz across all files for given compound relative to reference
'min_msms_score': .6, 'allow_no_msms': True, # strict = 0.6, loose = 0.3 <= highest compound dot-product score across all files for given compound relative to reference
'min_num_frag_matches': 1, 'min_relative_frag_intensity': .001} # strict = 3 and 0.1, loose = 1, 0.01 number of matching mzs when calculating max_msms_score and ratio of second highest to first highest intensity of matching sample mzs
scores_df = fa.make_scores_df(metatlas_dataset,hits)
scores_df['passing'] = fa.test_scores_df(scores_df, **kwargs)
pass_atlas_df, fail_atlas_df, pass_dataset, fail_dataset = fa.filter_atlas_and_dataset(scores_df, atlas_df, metatlas_dataset, column='passing')
fa.make_stats_table(input_dataset = metatlas_dataset, msms_hits = hits, output_loc = output_dir,min_peak_height=1e5,use_labels=True,min_msms_score=0.01,min_num_frag_matches=1,include_lcmsruns = [],exclude_lcmsruns = ['QC'])
scores_df.to_csv(os.path.join(output_dir, 'stats_tables','compound_scores.csv'))
In [ ]:
group = 'index' # 'page' or 'index' or None
save = True
share_y = True
dp.make_chromatograms(input_dataset=metatlas_dataset, group=group, share_y=share_y, save=save, output_loc=output_dir)
In [ ]:
dp.make_identification_figure_v2(input_dataset = metatlas_dataset, msms_hits=hits, use_labels=True, include_lcmsruns = [],exclude_lcmsruns = ['InjBl','QC','Blank','blank'], output_loc=os.path.join(output_dir,'msms_mirror_plots'))
In [ ]:
peak_height = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='peak_height', output_loc=os.path.join(output_dir,'data_sheets'))
peak_area = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='peak_area', output_loc=os.path.join(output_dir,'data_sheets'))
mz_peak = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='mz_peak', output_loc=os.path.join(output_dir,'data_sheets'))
rt_peak = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [],fieldname='rt_peak', output_loc=os.path.join(output_dir,'data_sheets'))
mz_centroid = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='mz_centroid', output_loc=os.path.join(output_dir,'data_sheets'))
rt_centroid = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='rt_centroid', output_loc=os.path.join(output_dir,'data_sheets'))
dp.plot_errorbar_plots(peak_height, output_loc=os.path.join(output_dir,'error_bar_peak_height'))
dp.plot_errorbar_plots(rt_peak, output_loc=os.path.join(output_dir,'error_bar_rt_peak'))