Protein Family Classification



In [2]:

    
import numpy as np
import pandas as pd
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.manifold import TSNE



In [3]:

    
family_classification_metadata = pd.read_table('../seminar_5/data/family_classification_metadata.tab')
family_classification_sequences = pd.read_table('../seminar_5/data/family_classification_sequences.tab')



In [4]:

    
family_classification_metadata.head()









    Out[4]:







  
    
      
      SwissProtAccessionID
      LongID
      ProteinName
      FamilyID
      FamilyDescription
    
  
  
    
      0
      Q6GZX4
      001R_FRG3G
      Putative transcription factor 001R
      Pox_VLTF3
      Poxvirus Late Transcription Factor VLTF3 like
    
    
      1
      Q6GZX3
      002L_FRG3G
      Uncharacterized protein 002L
      DUF230
      Poxvirus proteins of unknown function
    
    
      2
      Q6GZX0
      005R_FRG3G
      Uncharacterized protein 005R
      US22
      US22 like
    
    
      3
      Q91G88
      006L_IIV6
      Putative KilA-N domain-containing protein 006L
      DUF3627
      Protein of unknown function (DUF3627)
    
    
      4
      Q197F3
      007R_IIV3
      Uncharacterized protein 007R
      DUF2738
      Protein of unknown function (DUF2738)



In [5]:

    
family_classification_sequences.head()









    Out[5]:







  
    
      
      Sequences
    
  
  
    
      0
      MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQV...
    
    
      1
      MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQT...
    
    
      2
      MQNPLPEVMSPEHDKRTTTPMSKEANKFIRELDKKPGDLAVVSDFV...
    
    
      3
      MDSLNEVCYEQIKGTFYKGLFGDFPLIVDKKTGCFNATKLCVLGGK...
    
    
      4
      MEAKNITIDNTTYNFFKFYNINQPLTNLKYLNSERLCFSNAVMGKI...

Task:

Use your ProtVec embedding from homework 5 to perform protein family classification using RNN.

use 1000 most frequent families for classification
validate your results on the train-test split
reduce the dimensionality of the protein-space using Stochastic Neighbor Embedding and visualize two most frequent classes
compare your RNN results with SVM
visualization and metrics are up to you

Let's read the embedding matrix from the original article data.



In [6]:

    
table = pd.read_csv('data/protVec_100d_3grams_without_quotes.csv', sep='\t', header=None)
table = table.T
header = table.iloc[0] # grab the first row for the header
prot2vec = table[1:] # take the data less the header row
prot2vec.columns = header # set the header row as the df header
prot2vec["AAA"].head()









    Out[6]:





1    -0.17406
2   -0.095756
3    0.059515
4    0.039673
5   -0.375934
Name: AAA, dtype: object

2 most frequent families:



In [7]:

    
most_common_families = Counter(family_classification_metadata['FamilyID']).most_common(2)
most_common_families = [family for (family, count) in most_common_families]
family2num = {f: i for (i, f) in enumerate(most_common_families)}
family2num









    Out[7]:





{'Helicase_C': 1, 'MMR_HSR1': 0}



In [8]:

    
MAX_PROTEIN_LEN = 501
EMBED_LEN = 100



In [9]:

    
all_proteins = family_classification_sequences['Sequences']
all_families = family_classification_metadata['FamilyID']

selected_ids = [i for i in range(len(all_proteins)) 
                  if all_families[i] in family2num and len(all_proteins[i]) <= MAX_PROTEIN_LEN]



In [13]:

    
def embedding(protein):
    res = np.zeros(100)
    for i in range(0, (len(protein) - 3) // 3):
        try:
            res += prot2vec[protein[i*3: i*3 + 3]]
        except KeyError:
            res += prot2vec['<unk>']

    return res / ((len(protein) - 3) // 3)

#embedding(all_proteins[11])



In [14]:

    
selected_proteins = [embedding(p) for p in all_proteins[selected_ids]]



In [15]:

    
tsne = TSNE(n_components=2, random_state=42, angle=0.7, init='pca', n_iter=500)
XX = tsne.fit_transform(selected_proteins)



In [16]:

    
tsne_df = pd.DataFrame(XX, columns=['x0', 'x1'])



In [17]:

    
plt.figure(figsize=(10, 10))
colors = ['red', 'blue']
plt.scatter(tsne_df['x0'], tsne_df['x1'], c=[colors[family2num[f]] for f in all_families[selected_ids]], s=20);



In [ ]:

	SwissProtAccessionID	LongID	ProteinName	FamilyID	FamilyDescription
0	Q6GZX4	001R_FRG3G	Putative transcription factor 001R	Pox_VLTF3	Poxvirus Late Transcription Factor VLTF3 like
1	Q6GZX3	002L_FRG3G	Uncharacterized protein 002L	DUF230	Poxvirus proteins of unknown function
2	Q6GZX0	005R_FRG3G	Uncharacterized protein 005R	US22	US22 like
3	Q91G88	006L_IIV6	Putative KilA-N domain-containing protein 006L	DUF3627	Protein of unknown function (DUF3627)
4	Q197F3	007R_IIV3	Uncharacterized protein 007R	DUF2738	Protein of unknown function (DUF2738)

	Sequences
0	MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQV...
1	MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQT...
2	MQNPLPEVMSPEHDKRTTTPMSKEANKFIRELDKKPGDLAVVSDFV...
3	MDSLNEVCYEQIKGTFYKGLFGDFPLIVDKKTGCFNATKLCVLGGK...
4	MEAKNITIDNTTYNFFKFYNINQPLTNLKYLNSERLCFSNAVMGKI...