author: stefan.m.janssen@gmail.com
date: 08 Aug 2016
language: Python 3.5.2 :: Continuum Analytics, Inc.
conda environment: micronota
license: unlicensed
Currently the EMP comprises ~34k samples. Sometimes we need to operate on smaller sample numbers. Thus, the merged mapping file contains 'subset_xxx' columns, where xxx indicates the number of samples in the subset. These columns are np.boolean and flags if a sample of the whole EMP is part of subset xxx.
There are several point where we need to filter full data-sets to those smaller subsets, e.g. BIOM tables or beta diversity distance matrices. This notebook is an example of how to acomplish that.
In [1]:
# a trick to make it transparent if executed on a barnacle node or on my machine with barnacles file system mounted to /media/barnacle
import socket
ROOT = "/media/barnacle/" if socket.gethostname() == 't440s' else "/"
In [8]:
%matplotlib inline
# make all necessary imports
import pandas as pd
import sys
import os
import biom
import numpy as np
the next cell contains all parameters that might need to be changed
In [9]:
# file_mapping points to the merged EMP mapping file, which contains columns like 'subset_2000'.
file_mapping = ROOT + '/projects/emp/00-qiime-maps/merged/emp_qiime_mapping_latest.tsv'
# BIOM Tables:
# file_biom points to the full BIOM table, i.e. the input file
#file_biom = ROOT + '/projects/emp/03-otus/01-closed-ref-greengenes/emp_cr_gg_13_8.biom'
file_biom = ROOT + '/projects/emp/03-otus/01-closed-ref-silva-16S/emp_cr_silva_16S.biom'
# filename of the resulting filtered BIOM table, i.e. the output file
file_biom_subset = "./filtered.biom"
# Beta Diversity Matrices:
# filename of the full beta diversity matrix, i.e. the input file
file_betaDiv = ROOT + '/projects/emp/06-beta/01-closed-ref-greengenes/emp_cr_gg_13_8.bdiv-unweighted-unifrac.result'
# filename of the resulting filtered beta diversity matrix, i.e. the output file
file_betaDiv_subset = "./filtered_betadiv.txt"
In [21]:
# Reads a biom table from 'sourceFile' and remains only those samples whose IDs are in the list 'sampleIDsToKeep'. Result will be written to the file 'targetFile'
def filterBiomTable(sampleIDsToKeep, sourceFile, targetFile):
targetFile = os.path.abspath(targetFile)
print("Loading source biom table ...", end="")
counts = biom.load_table(sourceFile)
#collect available sample IDs from input biom table (some IDs of the list 'sampleIDsToKeep' might not be in the input biom file, e.g. if we don't have sequence reads)
counts_idx = counts.ids('sample')
print(" which holds " + str(len(counts_idx)) + " samples.")
toBeFiltered_idx = list(set(counts_idx) & set(sampleIDsToKeep))
notFound = list(set(sampleIDsToKeep) - set(counts_idx))
print("Sub-sampling down to " + str(len(toBeFiltered_idx)) + " of which " + str(len(notFound)) + " are not in the input BIOM table.")
#actual filtering
counts_subset = counts.filter(toBeFiltered_idx, axis='sample', inplace=False)
#create the new filename
print("Writing resulting biom table to '"+ targetFile +"'")
with biom.util.biom_open(targetFile, 'w') as f:
counts_subset.to_hdf5(f, "EMP sub-set generation notebook")
print("Generating summary.")
inp = targetFile
out = targetFile + ".summary.txt"
!biom summarize-table -i "$inp" > "$out"
In [24]:
# Reads a (beta diversity) distance matrix from 'sourceFile' and returns only those cells that belong to samples that are listed in 'sampleIDsToKeep'. Output is written to 'targetFile'
def filterBetadivMatrix(sampleIDsToKeep, sourceFile, targetFile):
targetFile = os.path.abspath(targetFile)
out = open(targetFile, "w")
f = open(sourceFile, "r")
columnsToKeep = []
foundIdx = []
lidx = list(sampleIDsToKeep)
matrixIdx = f.readline().rstrip().split("\t")
for i, sid in enumerate(matrixIdx):
if sid in lidx:
columnsToKeep.append(i)
foundIdx.append(sid)
out.write("\t" + sid)
out.write("\n")
notFound = list(set(sampleIDsToKeep) - set(foundIdx))
print("Sub-sampling down to " + str(len(sampleIDsToKeep)) + " of which " + str(len(notFound)) + " are not in the input matrix.")
for lineCount, line in enumerate(f):
fields = line.rstrip().split("\t")
if fields[0] in lidx:
out.write(fields[0])
for i in columnsToKeep:
out.write("\t" + fields[i])
out.write("\n")
if (lineCount % int(len(matrixIdx) / 10) == 0):
print(" processed " + str(int(lineCount / len(matrixIdx) * 100)) + " %")
print("wrote resulting distance matrix to '" +targetFile+ "'")
out.close()
f.close()
In [17]:
#we load the mapping file into a Pandas dataframe
metadata = pd.read_csv(file_mapping, sep="\t", index_col=0, low_memory=False)
#we chose a (sub-)set of sample IDs which we want to have in our output set. E.g. all sampleIDs that have a np.True_ in the 'subset_2000' column, i.e. those 2000 samples that form the EMP 2000 sub-set
idx = metadata[metadata['subset_2000'] == np.True_].index
#another example: choose all samples that are annotated as 'Animal distal gut' in EMPO level 3
#idx = metadata[metadata.empo_3 == 'Animal distal gut'].index
In [22]:
filterBiomTable(idx, file_biom, file_biom_subset)
It is also useful to filter in the same way the beta diversity metrices, which are of huge file size for the full EMP data-set (~14GB).
In [34]:
%time filterBetadivMatrix(idx, file_betaDiv, file_betaDiv_subset)
In [ ]: