Protein Family Classification


In [1]:
import pandas as pd

In [3]:
family_classification_metadata = pd.read_table('../seminar_5/data/family_classification_metadata.tab')
family_classification_sequences = pd.read_table('../seminar_5/data/family_classification_sequences.tab')

In [4]:
family_classification_metadata.head()


Out[4]:
SwissProtAccessionID LongID ProteinName FamilyID FamilyDescription
0 Q6GZX4 001R_FRG3G Putative transcription factor 001R Pox_VLTF3 Poxvirus Late Transcription Factor VLTF3 like
1 Q6GZX3 002L_FRG3G Uncharacterized protein 002L DUF230 Poxvirus proteins of unknown function
2 Q6GZX0 005R_FRG3G Uncharacterized protein 005R US22 US22 like
3 Q91G88 006L_IIV6 Putative KilA-N domain-containing protein 006L DUF3627 Protein of unknown function (DUF3627)
4 Q197F3 007R_IIV3 Uncharacterized protein 007R DUF2738 Protein of unknown function (DUF2738)

In [5]:
family_classification_sequences.head()


Out[5]:
Sequences
0 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQV...
1 MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQT...
2 MQNPLPEVMSPEHDKRTTTPMSKEANKFIRELDKKPGDLAVVSDFV...
3 MDSLNEVCYEQIKGTFYKGLFGDFPLIVDKKTGCFNATKLCVLGGK...
4 MEAKNITIDNTTYNFFKFYNINQPLTNLKYLNSERLCFSNAVMGKI...

In [7]:
family_classification_metadata.describe()


Out[7]:
SwissProtAccessionID LongID ProteinName FamilyID FamilyDescription
count 324018 324018 324018 324018 324018
unique 287308 295671 56951 7027 6967
top Q1X881 POLG_DEN3I UvrABC system protein B MMR_HSR1 50S ribosome-binding GTPase
freq 16 12 1500 3084 3084

Task:

Use your ProtVec embedding from homework 5 to perform protein family classification using RNN.

Article with the original research can be found here http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0141287&type=printable

  • use 1000 most frequent families for classification
  • validate your results on the train-test split
  • reduce the dimensionality of the protein-space using Stochastic Neighbor Embedding and visualize two most frequent classes
  • compare your RNN results with SVM
  • visualization and metrics are up to you