author: stefan.m.janssen@gmail.com
date: 08 Aug 2016
language: Python 3.5.2 :: Continuum Analytics, Inc.
conda environment: micronota
license: unlicensed

Filter EMP

Currently the EMP comprises ~34k samples. Sometimes we need to operate on smaller sample numbers. Thus, the merged mapping file contains 'subset_xxx' columns, where xxx indicates the number of samples in the subset. These columns are np.boolean and flags if a sample of the whole EMP is part of subset xxx.

There are several point where we need to filter full data-sets to those smaller subsets, e.g. BIOM tables or beta diversity distance matrices. This notebook is an example of how to acomplish that.


In [1]:
# a trick to make it transparent if executed on a barnacle node or on my machine with barnacles file system mounted to /media/barnacle
import socket
ROOT = "/media/barnacle/" if socket.gethostname() == 't440s' else "/"

In [8]:
%matplotlib inline

# make all necessary imports
import pandas as pd
import sys
import os
import biom
import numpy as np

the next cell contains all parameters that might need to be changed


In [9]:
# file_mapping points to the merged EMP mapping file, which contains columns like 'subset_2000'.
file_mapping = ROOT + '/projects/emp/00-qiime-maps/merged/emp_qiime_mapping_latest.tsv'

# BIOM Tables:
# file_biom points to the full BIOM table, i.e. the input file
#file_biom = ROOT + '/projects/emp/03-otus/01-closed-ref-greengenes/emp_cr_gg_13_8.biom'
file_biom = ROOT + '/projects/emp/03-otus/01-closed-ref-silva-16S/emp_cr_silva_16S.biom'

# filename of the resulting filtered BIOM table, i.e. the output file
file_biom_subset = "./filtered.biom"

# Beta Diversity Matrices:
# filename of the full beta diversity matrix, i.e. the input file
file_betaDiv = ROOT + '/projects/emp/06-beta/01-closed-ref-greengenes/emp_cr_gg_13_8.bdiv-unweighted-unifrac.result'
# filename of the resulting filtered beta diversity matrix, i.e. the output file
file_betaDiv_subset = "./filtered_betadiv.txt"

In [21]:
# Reads a biom table from 'sourceFile' and remains only those samples whose IDs are in the list 'sampleIDsToKeep'. Result will be written to the file 'targetFile'

def filterBiomTable(sampleIDsToKeep, sourceFile, targetFile):
    targetFile = os.path.abspath(targetFile)
    
    print("Loading source biom table ...", end="")
    counts = biom.load_table(sourceFile)
    
    #collect available sample IDs from input biom table (some IDs of the list 'sampleIDsToKeep' might not be in the input biom file, e.g. if we don't have sequence reads)
    counts_idx = counts.ids('sample')
    print(" which holds " + str(len(counts_idx)) + " samples.")

    toBeFiltered_idx = list(set(counts_idx) & set(sampleIDsToKeep))
    notFound = list(set(sampleIDsToKeep) - set(counts_idx))
    print("Sub-sampling down to " + str(len(toBeFiltered_idx)) + " of which " + str(len(notFound)) + " are not in the input BIOM table.")
    
    #actual filtering
    counts_subset = counts.filter(toBeFiltered_idx, axis='sample', inplace=False)
    
    #create the new filename
    print("Writing resulting biom table to '"+ targetFile +"'")
    with biom.util.biom_open(targetFile, 'w') as f:
        counts_subset.to_hdf5(f, "EMP sub-set generation notebook")
        
    print("Generating summary.")
    inp = targetFile
    out = targetFile + ".summary.txt"
    !biom summarize-table -i "$inp" > "$out"

In [24]:
# Reads a (beta diversity) distance matrix from 'sourceFile' and returns only those cells that belong to samples that are listed in 'sampleIDsToKeep'. Output is written to 'targetFile'

def filterBetadivMatrix(sampleIDsToKeep, sourceFile, targetFile):
    targetFile = os.path.abspath(targetFile)
    out = open(targetFile, "w")
    f = open(sourceFile, "r")
    columnsToKeep = []
    foundIdx = []
    lidx = list(sampleIDsToKeep)
    matrixIdx = f.readline().rstrip().split("\t")
    for i, sid in enumerate(matrixIdx):
        if sid in lidx:
            columnsToKeep.append(i)
            foundIdx.append(sid)
            out.write("\t" + sid)
    out.write("\n")
    
    notFound = list(set(sampleIDsToKeep) - set(foundIdx))
    print("Sub-sampling down to " + str(len(sampleIDsToKeep)) + " of which " + str(len(notFound)) + " are not in the input matrix.")
    
    for lineCount, line in enumerate(f):
        fields = line.rstrip().split("\t")
        if fields[0] in lidx:
            out.write(fields[0])
            for i in columnsToKeep:
                out.write("\t" + fields[i])
            out.write("\n")
        if (lineCount % int(len(matrixIdx) / 10) == 0):
            print("  processed " + str(int(lineCount / len(matrixIdx) * 100)) + " %")
            
    print("wrote resulting distance matrix to '" +targetFile+ "'")
    out.close()
    f.close()

In [17]:
#we load the mapping file into a Pandas dataframe
metadata = pd.read_csv(file_mapping, sep="\t", index_col=0, low_memory=False)

#we chose a (sub-)set of sample IDs which we want to have in our output set. E.g. all sampleIDs that have a np.True_ in the 'subset_2000' column, i.e. those 2000 samples that form the EMP 2000 sub-set 
idx = metadata[metadata['subset_2000'] == np.True_].index

#another example: choose all samples that are annotated as 'Animal distal gut' in EMPO level 3
#idx = metadata[metadata.empo_3 == 'Animal distal gut'].index

In [22]:
filterBiomTable(idx, file_biom, file_biom_subset)


Loading source biom table ... which holds 27398 samples.
Sub-sampling down to 1999 of which 1 are not in the input BIOM table.
Writing resulting biom table to './filtered.biom'
Generating summary.

It is also useful to filter in the same way the beta diversity metrices, which are of huge file size for the full EMP data-set (~14GB).


In [34]:
%time filterBetadivMatrix(idx, file_betaDiv, file_betaDiv_subset)


Sub-sampling down to 2000 of which 1 are not in the input matrix.
  processed 0 %
  processed 9 %
  processed 19 %
  processed 29 %
  processed 39 %
  processed 49 %
  processed 59 %
  processed 69 %
  processed 79 %
  processed 89 %
  processed 99 %
wrote resulting distance matrix to '/home/sjanssen/EMP/filtered_betadiv.txt'
CPU times: user 49.1 s, sys: 5.69 s, total: 54.8 s
Wall time: 55.3 s

In [ ]: