Computational phospho-proteomic network inference pipeline

Please see our bioRxiv preprint for additional information:

Network inference reveals novel connections in pathways regulating growth and defense in the yeast salt response. Matthew E. MacGilvray+, Evgenia Shishkova+, Deborah Chasman, Michael Place, Anthony Gitter, Joshua J. Coon, Audrey P. Gasch. bioRxiv 2017. doi:10.1101/176230

The user should define differentially changing phospho-peptides in the "WT" or "Parent" strain using their own criteria (eg; fold-change, p-value, etc.), followed by grouping/clustering phospho-peptides based on similar directionality of abundance change.

Files required ( Provided in the Git Repository )

Link or Copy reference/, required/ and Mok_kinase_PWMs/ directories into your working directory.

reference/orf_trans_all.20150113.fasta
reference/bgNtWk.csv
reference/Background_Network.csv
Mok_kinase_PWMs/Mok_KinasePWM*.csv
required/Kinase_Groups.csv
required/kinase_phosphatase_yeast.csv

User supplied files

idModules file, for Identity Modules and Submodules step.
Submodules.txt, user may provide if Identify Modules and Submodules step is skipped.

Output files

Motif-x -- single file and 3 directories which contain LOGO png's and original Motif-x result webpage.
Submodules.txt
Shared_interactors.csv -- list of all shared interactors w/ p-value and p-adjust values.
position_weight_matrix.txt -- use w/ Kullback-Leibler script.
2 directories for the Kullback-Leibler (unshuffled, shuffled)
Shared_interactors_Kinase_FDR.csv -- Kullback-Leibler kinase submodules w/ FDR score.
Network.sif

Identify motifs

This automates submitting jobs to the Motif-x Website (http://motif-x.med.harvard.edu/)

Input : single text file (called inputfiles in next cell) listing text (csv) files to process, one file name per line.

text file:
data_sheet1.csv
data_sheet2.csv

file format, comma separated:

Ppep	Group	Localized_Sequence	Motif_X_Input_Peptide
YAL035W_T388	Induced	AAEKILtPEsQLKK	AAEKILTPESQLKK

Column order is unimportant, column names must match above.

Output:

A text table named after the input file containing all the motifs matched to a gene.

Given an input file named, motifx_sample.xlsx the final results file will be:

motifx_sample-Motifx-results.txt

The rest of the results are in a directory. For instance if your input file is called motifx_sample.xlsx, 3 directories will be created one for each central character. These contain the LOGO pngs and the original html results page.

motifx_sample_T
motifx_sample_S
motifx_sample_Y

Enter current working directory in next cell, which must be run to load python libraries and set the working directory.



In [ ]:

    
# required python libraries
import Bio
import bisect
import errno
import glob
import itertools
import math
import os
import numpy as np
import pandas as pd
import random
import re
import shutil
from scipy.stats import hypergeom
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import motifs
from Bio.Alphabet import IUPAC
from rpy2.robjects.packages import importr

import yeast_Gene_name_to_ORF                 # for SGD Systematic Name to look-up Standard Name

# ENTER WORKING DIRECTORY 
current_dir = '/home/mplace/projects/forMatt/Phospho_Network/forPaper'
os.chdir(current_dir)
print('Working in directory: %s' % os.getcwd())



In [ ]:

    
%run -i 'Motifx.py' -f 'inputfiles' -u 'reference/orf_trans_all.20150113.fasta'

### WHEN COMPLETE SEE WORKING DIRECTORY FOR YOUR MOTIF-X RESULTS
### CHECK LOG.txt for errors

Identify Modules and Submodules step.

This script identifies co-regulated groups of phospho-peptides using the following approach:

1) Identifies 'modules', which are groups of phospho-peptides that exhibit the same directionality in stress-dependent abundance change (ie, increased 'Induced', or decreased 'Repressed') and the same motif. The module nomenclature is as follows: Induced/Repressed- motif (ex: Induced..RK.s....).

2) Partitions modules into 'submodules' based on their phospho-peptide constituents dependency on a protein(s) for stress-dependent abundance changes (ie, phospho-peptides that exhibit increased 'amplified' or decreased 'defective' abundance in a deletion strain compared to the 'WT' or 'Parental' type strain). These phenotypes are user defined. If two or more mutant phenotypes are recorded for a phospho-peptide then it's placed into two separate subModules (one for each mutant phenotype). If there was not a mutant phenotype at a user defined threshold then the phenotype is 'No-Phenotype'

The submodule nomenclature is as follows: module name-mutant phenotype/No-Phenotype (ex: Induced..RK.s....Mutant_Defective).

Possible submodule phenotypes: Induced-Defective, Induced-Amplified, Repressed-Defective, Repressed-Amplified, Induced-No-Phenotype, Repressed-No-Phenotype

The motifx results need to be classified by the user. Example output from the motifx :

Ppep	Cluster	Motifx	Peptide
YAL035W_T388	Name	......TP.....	AATPAATPTPSSA

The user must add the grouping. Here is an example of the idModules.csv file used in the next step.

Ppep	Cluster	Motif	Peptide	Mutant1	Mutant2
YAL035W_T388	Induced	......SP.....	NKKNEASPNTDAK	Induced	Repressed

Mutant1 and Mutant2 correspond to gene names
Induced,Repressed are groupings provided by user.

The resulting output file will have Submodules identified at the protein level.

Enter in Identify Modules & Submodules file.



In [ ]:

    
# INPUT FILE NAME 
inputData = 'idModules.csv'


# identifies 'modules', which are groups of phospho-peptides 
modData = set() # hold a unique list of Submodule,ORF data
ct      = 4     # position holder for input data list, after this position the mutant strain names begin
                # for example: "Ppep,Cluster,Motif,Peptide,gene1,gene2", gene1 & gene2 are used as strain names  

# open file to create Submodule,ORF list
with open(inputData) as file:
    columns = file.readline().rstrip().rsplit(',')    # grab header, ex. "Ppep,Cluster,Motif,Peptide,ire1,mkk1_2"
                                             # all columns after Peptide, are considered mutant strain names
    numStrains = len(columns[ct:])           # count number of mutant strains, all items after Peptide in header
    
    for line in file:
        row  = line.rstrip().split(',')     
        gene = row[0].split('_')[0]              # Get gene name from first column, YGR240C_S895 becomes YGR240C
        if row[2] == '':                         # skip rows w/ no motif
            continue
            
        countBlanks = row[ct:].count('')         # count number of missing mutant phenotypes 
        if countBlanks < numStrains:             # if at least one phenotype
            for idx, val in enumerate(row[ct:]): # loop through each phenotype
                name = row[0].split('_')[0]      # get gene name
                if val == '':                    # if phenotype is missing skip it
                    continue
                else:
                    subModule = row[1] + '_' + row[2]  + '_' + columns[ct + idx] + '_' + val + ',' + name + '\n'
                    modData.add(subModule)
        else: # here we have a motif but no strain phenotype
            name = row[0].split('_')[0]
            subModule = row[1] + '_' + row[2]  + '_' + 'No_Phenotype_Exists' + ',' + name + '\n'
            modData.add(subModule)

# Items in list should be unique
# write the Submodule constituent file
with open('Submodules.txt', 'w') as out:
    out.write('Submodule,ORF\n')
    for sb in modData:
        out.write(sb)
print('ID submodules Complete')

Identify Shared Interactors

Identify proteins enriched for interactions with Submodule constituent proteins, based on known interactions in the background network. We call these proteins 'Shared Interactors'. The background network is a protein interaction network curated in yeast under mostly nutrient replete conditions that contains 4638 proteins and ~ 25,000 interactions, including directed (ex; kinase-substrate), and non-directed interactions.

Proteins enriched for interactions with Submodule proteins at a 5% FDR, determined by a hypergeometric test and BH correction, are considered shared interactors.

Shared Interactors represent numerous functional classes, including kinases and phosphatases. Kinase and phosphatase shared interactors represent potential Submodule regulators.

HyperG function: distrib=hypergeom(N,M,n) distrib.pmf(m)

N - population size (4638 unique proteins in Background network file - phospho_v4_bgnet_siflike_withdirections_Matt_Modified.csv)
M - total number of successes in population (# of interactions for a given protein. ie. Protein A has 200 known interactions in the background network).
n - the number of trials (sample size) - ie. (Number of proteins that reside within a submdoule)
m - the number of successes - i.e. Protein A, a shared interactor, has 35 interactions with proteins in Submodule B.

The resulting shared interactor p-values are scored for significance using Benjamini-Hochberg proceedure. See the original paper for more information.

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing
Yoav Benjamini and Yosef Hochberg
Journal of the Royal Statistical Society. Series B (Methodological)
Vol. 57, No. 1 (1995), pp. 289-300

A file containing all shared interactors ( Shared_interactors.csv) can be found in current working directory.
Columns:

Submodule         - Name, i.e. Induced_......SP....._mkk1_2_Induced_Defective<br>
Shared_Interactor - Protein name, i.e. YBL079W
n                 - Sample size, hyperGeometric test<br> 
N                 - Population size, hyperGeometric test<br>
M                 - Number of successes in Population, hyperGeometric test<br>
m                 - Number of successes, hyperGeometric test<br>
p-value           - hyperGeometric p-value<br>
Interaction       - interaction type<br>
Direction         - input, output, none<br>
Standard_name     - nup170 for the above example of YBL079W<br>
(i/n)*Q        - value returned by BH test<br>
significance      - 1 = significant, 0 = not significant<br>

The shared_interactors.csv is not filtered for BH significance.



In [ ]:

    
# load the background network into data structure
bgNtwk = 'reference/bgNtWk.csv'

# load background network w/ directionality
bgNtwk_directions = 'reference/Background_Network.csv'

# Population Size for Hypergeometric test
# Change if required
N = 4638                

# Significance test cutoff value Q
Q = 0.05

# CHANGE OUTPUT FILE NAME HERE IF DESIRED
outFile = open('Shared_interactors.csv', 'w')

# use dictionary of sets, key = Protein name i.e. YGL120C  value = set of all genes and interaction type 
# Interaction type is not stored, it looks like :
# ntwk['YDR174W'] = {('YNL135C', 'ppi'), ('YPR104C', 'ppi'), ('YER148W', 'ppi'), ('YGR274C', 'ppi'),
# ('YDR174W', 'ppi'), ('YJL074C', 'ppi'), ('YML015C', 'ppi'), ('YDR510W', 'ppi')}
ntwk = {}

with open(bgNtwk, 'r') as file:
    file.readline()                                     # skip header
    for line in file:
        dat = line.rstrip().split(',')                  # line looks like: YGR240C,YJL039C,ppi,0
        if dat[0] not in ntwk:                          # check 1st protein & add 2nd protein to set
            ntwk[dat[0]] = {}
            ntwk[dat[0]][dat[1]] = dat[2]
        else:
            ntwk[dat[0]][dat[1]] = dat[2]
        
        if dat[1] not in ntwk:                          # check 2nd protein & add 1st protein to set
            ntwk[dat[1]] = {}
            ntwk[dat[1]][dat[0]] = dat[2]
        else:
            ntwk[dat[1]][dat[0]] = dat[2]
            
file.close()

direction = {}                     # key = YJL187C_YDR054C value = ['kinase_substrate', 'ubiquitination:Reversed']
with open(bgNtwk_directions, 'r') as f:
    f.readline()                                         # skip header
    for line in f:
        dat = line.rstrip().split(',')
        interaction = dat[0] + '_' + dat[1]              # join protein names to make key
        if interaction not in direction:
            direction[interaction] = []
            direction[interaction].append(dat[2])        # add interaction
        else:
            direction[interaction].append(dat[2])            
f.close()

# Submodule constituent information is in modData from previous step
# collect the Genes & counts for proteins in a submodule 
# an example submodule looks like: Repressed_...R..S......_ire1_Repressed_Defective
motifCt = {}  #key = submodule, value = a dict w/ 'count' = num of motif, 'proteins' = proteins in submodule group 

for i in modData:
    dat = i.split('_')                    
    gene = dat[-1].rstrip().split(',')[1] # split to get gene name
    sbMds = i.rstrip().split(',')         # get submodule name "Induced_...K..SP....._ire1_Induced_Amplified"
    if sbMds[0] not in motifCt:           # add submodule name if not in dict
        motifCt[sbMds[0]] = {}
        motifCt[sbMds[0]]['count'] = 1    # this is "n" for hypergeometric test, example:  ...K..SP.....
        motifCt[sbMds[0]]['proteins'] = []
        motifCt[sbMds[0]]['proteins'].append(gene)
    else:                                  
        motifCt[sbMds[0]]['count'] += 1              # count the number of submodules 
        motifCt[sbMds[0]]['proteins'].append(gene)   # gene name which participates in submodule
        
# now get the background protein interactors for each ORF/GENE in each submodule group
# For instance 
# groupd :  Repressed_......TP....._mkk1_2_Repressed_Defective, has 2 ORFs in the group: YGR202C,YLR190W
# take each one and find the background protein interactors
# print submodule information
# example:
# Induced_...R..S......_mkk1_2_Induced_Defective 11
# YNL183C,YGL049C,YDL022W,YBL058W,YHR097C,YOR220W,YPL085W,YFL007W,YDR135C,YOL087C,YJR005W  

outFile.write( 'Submodule,Shared_Interactor,n,N,M,m,p-value,Interaction,Direction,Standard_name,(i/n)*Q,significance\n')   # write file header

testList  = []                                                      # list of lists , holds pvalue results
sigPValue = {}                                                      # store p-value for significance test
sigScore  = {}                                                      # significance dict used to implement 
                                                                    # threshold cutoff (i/m)*Q < p-value
                                                                    # key = SubModule + Gene Name ,value = significance Score
        
# loop through all items in motifCt[submodule] items
for subMod, val in motifCt.items():
    n           = motifCt[subMod]['count']                           # number of submodule members
    match       = set()                                              # list of interacting proteins w/ in a submodule
    count_m     = {}                                                 # count the occurances for all proteins in subMod
    interaction = {}                                                 # dict of sets stores interaction type for Shared Interactor
    for prot1, prot2 in itertools.combinations(val['proteins'],2):   # this should do all against all comparisons
        if prot1 in ntwk and prot2 in ntwk:          
            proteins = ntwk[prot1].keys() & ntwk[prot2].keys()       # perform set like intersection operation
        if proteins:  
            for p in proteins:                                       # make a unique count of proteins
                match.add(p)
                if prot1 not in count_m:
                    count_m[prot1] = set()
                    count_m[prot1].add(p)
                else:
                    count_m[prot1].add(p)
                # Find the interaction types for a subModule protein and it's constituents
                # the dict of sets stores the unique interaction types for each constituent
                if prot1+'_'+p in direction:                         # if 1st protein_consituent has a direction
                    if p not in interaction:
                        interaction[p] = set()
                        interaction[p].add(''.join(direction[prot1+'_'+p]))
                    else:
                        interaction[p].add(''.join(direction[prot1+'_'+p]))

                if prot2+'_'+p in direction:                         # 2nd protein_constituent 
                    if p not in interaction:
                        interaction[p] = set()
                        interaction[p].add(''.join(direction[prot2+'_'+p]))
                    else:
                        interaction[p].add(''.join(direction[prot2+'_'+p]))
                                      
    for i in match:
        # find interaction type
        if len(interaction[i]) > 1:                                     # multiple interaction types
            iType = 'None'
        else:
            iType = ' '.join(interaction[i])                            #  one interaction
            
        if 'Reversed' in iType:                                         # Classify input/output direction
            directionality = 'output'
        elif iType == 'None':                                           # multiple interaction types
            directionality = 'None'
        else:
            directionality = 'input'                                    # ppi, kinase_substrate, 
        
        m = 0                                                           # total number of SI w/in a submodule group
        M = len(ntwk[i])                                                # Number of genes in background network
        for k in count_m.keys():                                        # count the number of SI
            if i in count_m[k]:
                m += 1
        pvalue = hypergeom.sf(m-1,N,M,n)                                # call hypergeometric test
        testList.append([subMod,i,str(n),str(N),str(M), str(m), str(pvalue), iType, directionality])
        if subMod + '_' + i not in sigPValue:
            sigPValue[subMod + '_' + i] = pvalue
        else:
            print('Duplicates in sigPValue')
            
    match = set()

# Significance test procedure   (p-value(rank)/numTests) * Q, Q = 0.5 cutoff by default         
numTests = len(sigPValue)          
rank     = 1.0                                                 # rank value
maxRank  = 0                                                   # max rank position for the BH cutoff
idx      = 0
for k in sorted(sigPValue, key=lambda k: sigPValue[k]):        # sorts sigPValue by value returning the keys 
    test = (rank/numTests) * Q  
    rank += 1.0                         
    sigScore[k] = test
    
    # if p-value is less than (i/numTest)*Q set record position, used later to report BH significance
    if float(sigPValue[k]) < test:
        maxRank = idx
    idx += 1

# get a set of the names where BH score is significant
count   = 0
rankSet = set()
for k in sorted(sigPValue, key=lambda k: sigPValue[k]):
    if count <= maxRank:
        rankSet.add(k)
        count += 1
    else:
        break
        
for i in testList:                                   # write results to file
    tmpName = i[1]                                   # for some reason the bgntwk has all of the dashes removed
    oldName = list(i[1])                             # here we put them back for name look-up  
    sigLookup = i[0] + '_' + tmpName                 # for sig dict value lookup
    
    if oldName[-1] in ['A','B','D']:
        endChar = oldName.pop()
        nwName  = ''.join(oldName) + '-' + endChar
    elif tmpName.endswith('CC'):
        endChar = oldName.pop()
        nwName  = ''.join(oldName) + '-' + endChar 
    elif tmpName.endswith('WC'):
        endChar = oldName.pop()
        nwName  = ''.join(oldName) + '-' + endChar        
    else:
        nwName = i[1]       
    
    if sigLookup in rankSet:
        significance = '1'
    else:
        significance = '0'
        
    outFile.write('%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n'  %(i[0], i[1], i[2], i[3], i[4],\
                                                       i[5], i[6], i[7], i[8], yeast_Gene_name_to_ORF.sc_orfToGene[nwName],\
                                                       str(sigScore[sigLookup]),significance ) )

outFile.close()
print('Identify Shared Interactors Complete')

Preparation for Kullback-Leibler Script

input: idModules.csv file is parsed to produce a individual fasta files for each Module as shown:

Module	Name	Sequence
Induced_......SP.....	YJL082W_S187	KNSSSPSPSEKSQ

All peptide sequences should be the same length (13 amino acids).

Module constituents will used here, not submodules. Fasta Files are placed in a directory called: fasta/

The output Fasta format files are named with their module designation.



In [ ]:

    
fasta = {}                                                   # dictionary, key = Module, value is ">Name\nSequence"

with open(inputData, 'r') as f:                              # open idModules.csv file
    f.readline()
    for line in f:
        dat = line.rstrip().split(',')
        if dat[3] == '':                                     # if sequence missing skip
            continue
        name = dat[1] + '_' + dat[2]                         # create Module name to use as key 
        if name not in fasta:
            fasta[name] = []
            fasta[name].append('>' + dat[0] + '\n' + dat[3] + '\n') # Name plus Sequence
        else:
            fasta[name].append('>' + dat[0] + '\n' + dat[3] + '\n')
f.close()

# if fasta directory does not exist, create it.
if not os.path.exists( current_dir + '/fasta'):
    try:
        os.makedirs('fasta')
    except OSError as exception:
        if exception.errno != errno.EEXIST:
            raise

# write individual fasta files
for k,v in fasta.items():
    with open('fasta/' + k + '.fasta', 'w') as out:
        for i in v:
            out.write(i)

# Remove duplicate sequences from each fasta file
for fasta in glob.glob('fasta/*'):
    clean = {}
    for seq_record in SeqIO.parse(fasta, 'fasta'):           # create Seq objects
        s = str(seq_record.seq)
        if s not in clean:
            clean[s] = seq_record                            # only keep unique sequences
            
    out_handle = open('tmp.fasta', 'w')              
    
    for k,v in clean.items():                                # write unique sequences to tmp file  
        SeqIO.write(v, out_handle, 'fasta')
    out_handle.close()
            
    shutil.move('tmp.fasta', fasta)                          # overwrite original fasta file    

# OUTPUT: fasta/*.fasta 

alphabet = IUPAC.protein                                     # use protein alphabet
instances = []
# list of amino acids used to print the position weight matrix
AminoList = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y' ]
# column numbers for printing pwm, length of peptide, assumed to be 13, if different change last value
pep_Header = ','.join([str(i) for i in range(0,13)])     

# creates Position weight matrix includes all Modules 
instances = []
with open('position_weight_matrix.txt', 'w') as out:
    out.write('Motif,AA,%s\n' %(pep_Header))                   
    for x in os.listdir('fasta/'):                            # process all fasta files in the directory
        if x.endswith('.fasta'):
            with open('fasta/' + x, "r") as f:
                for line in f:
                    if line.startswith('>'):                                       
                        continue
                    line = line.rstrip()                                                 
                    instances.append(Seq(line, IUPAC.protein))# add amino acid sequence to instances
                m = motifs.create(instances)
                pwm = m.counts.normalize(pseudocounts = 1)    # Add a +1 pseudocount
                instances = []
                name = re.sub('.fasta', '', x)                # use file name for 1st column          
                for aa in AminoList :
                    score = [ str(i) for i in pwm[aa]]
                    score = ','.join(score)
                    out.write('%s,%s:,%s\n' %(name,aa,score))
out.close()                   
# OUTPUT: position_weight_matrix.txt

Run Kullback-Leibler Module to Each Kinase

To quantify similarity between the Mok et. al. kinase PWMs and the module PWMs. Script employs a previously described quantitative motif comparison method called Kullback-Leibler divergence (KLD) (Thijs et al., 2002, Gupta et al., 2007). KLD generates a similarity measure by comparing the Kullback-Leiber distance, or information content, for each amino acid at each position between a query and comparison PWM. The more alike two PWMs are, the closer to zero the score approaches.

KLD(X,Y) = 1/2 (E Xalog(Xa/Ya) + E Yalog(Ya/Xa)) Where ‘X’ represents a query PWM position and ‘Y’ a comparison PWM position. Xa indicates the probability of a given amino acid a ε A in X. The symbol ‘A’ represents the length of the motif alphabet, which is 20, representing each of the naturally occurring amino acids.

Input: A plain text .csv file that contains all module position weight matrices. Each module PWM should have 20 rows, representing each of the 20 naturally occurring amino acids. They are in a column called "AA" which stands for amino acid. There should also be 13 columns, labeled 0-12 (representing the 13 amino acid sequence length of the phospho-peptides used to build the position weight matrix) that contain the frequency of each amino acid at each position.

Csv file format Motif,AA,0,1,2,3,4,5,6,7,8,9,10,11,12 Induced_...sP.,P:,0.05,0.05,0.03, 0.05,0.05,0.03,0.05,0.05,0.03, 0.05,0.05,0.03

A directory that contains the Mok et al kinase PWMs is required. They have the identical format as above. They have been pre-generated and are available for download on Github. The directory is titled, "Mok_kinase_PWMs" Please specify the full path to the Mok directory.

Kullback-Leibler is run using: Shuffle-kullback-Leibler.py

usage: kullback-Leibler.py -f Position_weight_matrix.csv

Shuffle kullback-Leibler results for use w/ FDR function.

optional arguments:
-h, --help show this help message and exit
-f , --file position_weight_matrix.txt file
-m , --mokdir Full path to Mok_kinase_PWMs directory
-i , --iterations Total number of iterations, this will be divided by number of processes.
-o , --out Output directory
-p , --processes Number of processes to run, be smart don't use more than you have!

Kullback-Leibler is first run w/ just a single iteration (default) this will not shuffle the data.
Input data will be shuffled for any number of iterations greater than 1



In [ ]:

    
# run KL test on original data, output is put in 'Kullback-Leibler' directory, unless you change it.
%run -i 'kullback-Leibler.py' -f 'position_weight_matrix.txt' -m '/home/mplace/projects/forMatt/Phospho_Network/forPaper/Mok_kinase_PWMs/'

Kullback-Leibler Module to Each Kinase Shuffled 1000x

Purpose: The same algorithm as the last step is used but w/ 1000 Shuffles of the Mok Kinase PWMs are performed by the script, generating randomized PWMs that are compared against the Module PWMs, producing a distribution of scores.

Output: A directory containing plain text .csv files named after each module. Within the .csv files are 63,000 KLD scores representing how well the 63 Mok et al kinases match the module motif after 1000 permutations of each Mok kinase.

Change the number of iterations(-i) and processes (-p) in next cell.

Iterations should be evenly divisible by the number of processes.

Remember to change the ouput Directory (-o).

This step is time intensive, often overnight if using a single process.

Use the -p flag to increase the number of cores/processes to use for the job.

Be smart use fewer cores/processes than you have.



In [ ]:

    
# run KL test on original data, output is put in 'Kullback-Leibler' directory
%run -i 'kullback-Leibler.py' -f 'position_weight_matrix.txt' -m '/home/mplace/projects/forMatt/Phospho_Network/forPaper/Mok_kinase_PWMs/' -i 1000 -p 4 -o '/home/mplace/projects/forMatt/Phospho_Network/forPaper/KL-shuffle/'

Calculate FDR for Each Kinase in Module using Kullback-Leiber results and write Network.SIF file.

Identify FDR scores for each Mok et. al. kinase and each module by comparing the non-shuffled scores to the distribution of shuffled scores. The user can then manually define the FDR cutoff to call kinases "motif-match" or "non-match" for a given module.

Input: Two directories and a single plain text .csv file, Kinase_Groups.csv provided in the git repository. The first directory contains plain text .csv files with KLD scores for non-shuffled Mok et. al. kinases and Modules. The second directory contains plain text .csv files containing KLD scores for shuffled Mok et. al. kinases and Modules.

Csv format (For both Input Directories), Module name is taken from file name.

Scores	Kinase
13.25	cdc15

Output: A table that contains for each module, all yeast kinases, including those found in the Mok et. al. dataset and those that were absent, and their FDR scores for each module. Kinases not found in the Mok et. al. dataset are given an FDR score of 1.

Use jupyter command "Run all below" from this point on.

Output:
Shared_interactors_Kinase_FDR.csv -- all yeast kinases and their FDR scores for each module, Kinases not found in the Mok et. al. dataset are given an FDR score of 1.
Network.sif -- final network file



In [ ]:

    
# Kullback-Leiber NON-Shuffled data directory
KLDir = 'Kullback-Leibler/'

# Kullback-Leiber shuffled data directory
shufDir = 'KL-shuffle/'

# Dict Kullback-Leibler NON-Shuffled values, key = Module, value = dict key = kinase, value = score
# {'Induced_...RR.S......': {'cka1': 12.263219372654575, ....}
kldata = {}  

# Open Kullback-Leibler directory
for file in glob.glob( 'Kullback-Leibler/' + "*.csv"):
    with open(file, 'r') as f:
        name = re.sub('.csv', '', os.path.basename(file))           # get file name from path
        if name not in kldata:                          
            kldata[name] = {}
        for line in f:                                              # clean up line and split
            line = re.sub('\(','',line)
            line = re.sub('\)','',line)
            line = re.sub('\s+','',line)
            line = re.sub('\'', '', line)
            dat = line.rstrip().split(',')
            if dat[1] not in kldata[name]:
                kldata[name][dat[1]] = float(dat[0])

    f.close()   
    
# dict Kullback-Leibler shuffled data values, key = Module, value = dict key = kinase, value = list of scores
# All scores are placed in a single list (NOT by Kinase)
klshuf = {}

for file in glob.glob(shufDir + '*.csv'):
    with open(file, 'r') as f:
        name = re.sub('.csv', '', os.path.basename(file))          # get file name from path
        if name not in klshuf:                          
            klshuf[name] = []                                      # if module not in dict add it
        for line in f:                                             # clean up line and split
            line = re.sub('\(','',line)
            line = re.sub('\)','',line)
            line = re.sub('\s+','',line)
            line = re.sub('\'', '', line)
            dat = line.rstrip().split(',')
            klshuf[name].append(float(dat[0]))                      # add KL score
            
    f.close()
    
# Sort scores for module
for k,v in klshuf.items():
    v.sort()     

# Get the total number of shuffled Kullback-Leibler trials for the False Discovery Rate calculation
# use a random file chosen from the shuffled results directory
with open(random.choice(glob.glob(shufDir + '*.csv')),'r') as f:
    for trial, l in enumerate(f, 1):
        pass

# Find the position of the non-shuffled kinase score for each Module and store the value in fdr dictionary
FDR = {}   # dict of dicts, key = Module value = dict key = kinase, value = dict { 'fdr', numScoresBelow','group'}

# calculate the FDR score for each kinase in a Module
for module, scores in kldata.items():                                # Start w/ non-shuffled data
    for kinase in scores.keys():
        pos = bisect.bisect_left(klshuf[module], scores[kinase] )    # returns the num of scores < original score
        fdr = pos/trial                                              # divide pos by the total num of trials
        if module not in FDR:                                   
            FDR[module] = {}
            FDR[module][kinase] = {}
            FDR[module][kinase]['fdr'] = fdr
            FDR[module][kinase]['numScoresBelow'] = pos
            FDR[module][kinase]['score'] = scores[kinase]
        else:
            if kinase not in FDR[module]:
                FDR[module][kinase] = {}
            FDR[module][kinase]['fdr'] = fdr
            FDR[module][kinase]['numScoresBelow'] = pos 
            FDR[module][kinase]['score'] = scores[kinase]

# add Group (According to Mok) to the FDR dict for each Module
# open and load required/Mok_Kinase_Groups_Corrected.csv file
mok = {}               # dict to store Mok_Kinase_Group information
mokName = {}           # dict maps standard name to SGD systematic name

with open('required/Kinase_Groups.csv', 'r') as file:
    file.readline()                       # skip header
    for line in file:
        dat = line.rstrip().split()
        if dat[0] not in mok:             # key = kinase(yck3), value =  Acidophillic
            mok[dat[0]] = dat[2]
        if dat[0] not in mokName:         # key = kinase(prr1), value = YKL116C
            mokName[dat[0]] = dat[1]

file.close()

for k,v in FDR.items():                   # Go through FDR dict and add group of each kinase w/in a Module 
    for kinase,score in v.items():        # key = Induced_...RR.S......, group = 'Acidophillic'  
        score['group'] = mok[kinase]
        

# open shared_interactors to match kinase name to Shared interactor name
# this will include kinases 'Not_in_Mok' as well, for completeness' 
with open('Shared_interactors.csv','r') as f, open('Shared_interactors_Kinase_FDR.csv', 'w') as out:
    header = f.readline().rstrip()
    out.write (header + ',Kinase,Module,Group(Mok),NumScoresBelow,FDR,KLD_score\n')
    tmp = {}                              # dict used for sorting
    for line in f:
        si = line.rstrip().split(',')
        grouping = si[0].split('_')
        module = grouping[0] + '_' + grouping[1]
        # if we have a kinase, keep it
        stdName = si[9].lower()
        print(stdName)
        if stdName in mok:
            if mok[stdName] == 'Not_in_Mok':
                outline = ','.join(si) + ',' + stdName + ',' + module + ',' + 'Not_in_Mok' + ',0' + ',0' + ',0\n'

                if si[0] not in tmp:
                    tmp[si[0]]={} 
                
                if stdName not in tmp[si[0]]:
                    tmp[si[0]][stdName] = { 'fdr': 0, 'data': outline }
            else:
                outline = ','.join(si) + ',' + stdName + ',' + module + ',' + FDR[module][stdName]['group'] +\
                ',' + str(FDR[module][stdName]['numScoresBelow']) +  ',' + str(FDR[module][stdName]['fdr']) +\
                ',' + str(FDR[module][stdName]['score']) + '\n' 

                if si[0] not in tmp:
                    tmp[si[0]]={} 
                
                if stdName not in tmp[si[0]]:
                    tmp[si[0]][stdName] = { 'fdr': FDR[module][stdName]['fdr'], 'data': outline }

    # sort each subModule group by the FDR and write to file
    for m in tmp.keys():                             
        fdrOrd = sorted(tmp[m].items(), key=lambda k_v: k_v[1]['fdr'])
        for i in fdrOrd:
            out.write((i[1]['data']))           
        
out.close()
f.close()


# load kinase & phosphatase annotation
annotation = {}                                        # key = ORF name, value = annotation( Kinase or Phosphatase)
with open('required/kinase_phosphatase_yeast.csv', 'r') as f:
    f.readline()                                                # skip header
    for line in f:
        dat = line.rstrip().split(',')
        if dat[0] not in annotation:
            annotation[dat[0]] = dat[1]
f.close()

# Open Submodules.txt, Shared_interactors.csv to create the final Network SIF file

outFile = open('Network.sif', 'w')                              # open output file
header = ['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation' '\n']    # header column names
outFile.write('\t'.join(header))

with open('Submodules.txt', 'r') as f:
    f.readline()                                               # skip header
    for sub in f:
        line = sub.rstrip().split(',')
        if line[1] in annotation:
            line.append(annotation[line[1]])
        else:
            line.append('\t')
        line.insert(1,'Constituent')
        outFile.write('\t'.join(line))
        outFile.write('\n')
f.close()

# find the max difference between the ordered FDR values within a submodule
# dict, key = subModule Name, value = { 'fdr': list of fdr scores,  'data': list of original row data }
cutoff = {}   
track  = {}  # dictionary to keep track of Shared_interactors printed which are kinases, so we don't double print

with open('Shared_interactors_Kinase_FDR.csv', 'r') as f:
    f.readline()                                                 # skip header
    for line in f:
        dat = line.rstrip().split(',')
        
        if dat[0] not in cutoff:
            cutoff[dat[0]] = { 'fdr' : [], 'data' : [] }
            
        cutoff[dat[0]]['fdr'].append(dat[-2])                    # get FDR score
        cutoff[dat[0]]['data'].append(line.rstrip())             # get original line from file
f.close()

maxDif = -10     # maximum difference between steps in fdr list
index  = -1      # index of list, 1 to index will be output in final sif

for k,v in cutoff.items():
    for i in range(len(v['fdr']) - 1):
        dif = math.fabs(float(v['fdr'][i]) - float(v['fdr'][i+1]))      
        #print(i, i+1,v['fdr'][i] ,v['fdr'][i+1], dif) 
        if dif > maxDif:                                         # set index and difference
            maxDif = dif
            index = i + 1
    #print(maxDif, index)
    for x in range(len(v['fdr'])):                               # Not set the specificity for kinases
        row = v['data'][x].split(',')
        if row[0] not in track:
            track[row[0]] = set()
            track[row[0]].add(row[1])
        else:
            track[row[0]].add(row[1])
        
        if row[13] == 'Not_in_Mok':
            specificity = 'Unknown Specificity Match'
        else:
            if x < index:                                        # if < cutoff (maxDif)
                specificity = 'Motif Match'
            else:
                specificity = 'Unmatched Specificity'
                
        if row[9] == 'output' or row[9] == 'None':
            outFile.write('%s,%s,%s,%s\n' %(row[0], specificity, row[1], 'kinase'))
        else:
            outFile.write('%s,%s,%s,%s\n' %(row[1], specificity, row[0], 'kinase'))
            
    
    # reset for next subModule
    maxDif = -10
    index = -1

outFile.close()



In [ ]:

    
outFile = open('Network.sif', 'a')
                
# open Shared_interactors.csv and add Shared interactors that are not kinases
with open('Shared_interactors.csv', 'r') as f:
    f.readline()
    for line in f:
        dat = line.rstrip().split(',')
        if dat[0] not in track:
            if dat[9] == 'output' or dat[9] == 'None':
                outFile.write('%s,%s,%s\n' %(dat[0], 'Shared Interactor', dat[1]))
            else:
                outFile.write('%s,%s,%s\n' %(dat[1], 'Shared Interactor', dat[0]))            
        else:
            if dat[1] not in track[dat[0]]:
                if dat[9] == 'output' or dat[9] == 'None':
                    outFile.write('%s,%s,%s\n' %(dat[0], 'Shared Interactor', dat[1]))
                else:
                    outFile.write('%s,%s,%s\n' %(dat[1], 'Shared Interactor', dat[0]))
f.close()
outFile.close()