Fire up GraphLab Create


In [1]:
import graphlab
import pandas
from Bio import SeqIO


A newer version of GraphLab Create (v1.7.1) is available! Your current version is v1.6.1.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.

Load a tabular data set


In [2]:
sf=graphlab.SFrame('people-example.csv')
table1=graphlab.SFrame('table1.tsv')
hand_nt=open('1648.160_nt.fna', 'r')
hand_aa=open('1648.160.faa', 'r')

table1_nt=hand_nt.read()
table1_aa=hand_aa.read()

hand_nt.close()
hand_aa.close()


[INFO] This non-commercial license of GraphLab Create is assigned to andriy.sheremet@ucalgary.ca and will expire on October 26, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-11240 - Server binary: C:\Anaconda\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\Andriy\AppData\Local\Temp\graphlab_server_1448140378.log.0
[INFO] GraphLab Server Version: 1.6.1
PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\people-example.csv
PROGRESS: Parsing completed. Parsed 7 lines in 0.03125 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\people-example.csv
PROGRESS: Parsing completed. Parsed 7 lines in 0.015637 secs.
PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\table1.tsv
PROGRESS: Parsing completed. Parsed 100 lines in 0.015624 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,long,long,long,str,long,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\table1.tsv
PROGRESS: Parsing completed. Parsed 1575 lines in 0.015625 secs.

Defining user functions


In [12]:
def returnLen(text):
    return len(text)

def gc(dna):
    "This function computes the GC ratio of a dna sequence"
    nbases=dna.count('n')+dna.count('N')
    gc_cont=float(dna.count('c')+dna.count('C')+dna.count('g')+dna.count('G'))/(len(dna)-nbases)
    return gc_cont

Processing FastA files


In [30]:
sf #we can view first few lines of the table
#table1_aa
#table1_nt
lst_aa=table1_aa.split('>')
lst_nt=table1_nt.split('>')
lst_nt=lst_nt[1:]#first (No. 0) list item is empty
#dict_nt
lst_aa=lst_aa[1:]# the same for amino acids

Converting gbk-file


In [3]:
SeqIO.convert("1648.160.gbk", "genbank", "1648.160_To_FastA.fna", "fasta") #output is not so friendly


Out[3]:
723

In [7]:
#hand_gb=open('1648.160.gbk','r')
#out_file = "1648.160_To_FastA_parced.fna"
#hand_out = open(out_file, "w")

#for seq_record in SeqIO.parse(hand_gb, "genbank") :
#    print "Dealing with GenBank record %s" % seq_record.id
#    hand_out.write(">%s %s\n%s\n" % (
#           seq_record.id,
#           seq_record.description,
#           seq_record.seq))

#hand_out.close()
#hand_gb.close()

This seems even better that all I've done before: there's one file containing all the data

tab separated text format


In [2]:
Banks969_Li_SFrame=graphlab.SFrame('6666666.156894.tsv')
Banks969_Li_SFrame
Banks969_Li_SFrame.show()


[INFO] This non-commercial license of GraphLab Create is assigned to andriy.sheremet@ucalgary.ca and will expire on October 26, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-10556 - Server binary: C:\Anaconda\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\Andriy\AppData\Local\Temp\graphlab_server_1448764274.log.0
[INFO] GraphLab Server Version: 1.6.1
PROGRESS: Finished parsing file D:\library\lessons\CMMB 549 Microbial Genetics\plasmidAnnot\plasmid_2015\6666666.156894.tsv
PROGRESS: Parsing completed. Parsed 100 lines in 0.02802 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,str,long,long,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file D:\library\lessons\CMMB 549 Microbial Genetics\plasmidAnnot\plasmid_2015\6666666.156894.tsv
PROGRESS: Parsing completed. Parsed 576 lines in 0.017013 secs.
Canvas is accessible via web browser at the URL: http://localhost:64154/index.html
Opening Canvas in default web browser.

Adding new columns: Nt_length, GC content, Aa_length

!Attention: if used this way, RNA data, specifically, GC content will be included in the calculations filter it out if it isn't needed, or use data obtained from convertion of fastA files


In [23]:
Banks969_Li_SFrame["Calculated_length_nt"]=Banks969_Li_SFrame['nucleotide_sequence'].apply(returnLen)
Banks969_Li_SFrame["Gene_GC"]=Banks969_Li_SFrame['nucleotide_sequence'].apply(gc)
Banks969_Li_SFrame["Calculated_length_aa"]=Banks969_Li_SFrame['aa_sequence'].apply(returnLen)
Banks969_Li_SFrame


Out[23]:
contig_id feature_id type location start stop strand
NODE_100_length_4674_cov_
7.14051_ID_199 ...
fig|1648.160.peg.1 peg NODE_100_length_4674_cov_
7.14051_ID_199_39_530 ...
39 530 +
NODE_100_length_4674_cov_
7.14051_ID_199 ...
fig|1648.160.peg.2 peg NODE_100_length_4674_cov_
7.14051_ID_199_693_1601 ...
693 1601 +
NODE_100_length_4674_cov_
7.14051_ID_199 ...
fig|1648.160.peg.3 peg NODE_100_length_4674_cov_
7.14051_ID_199_1594_2829 ...
1594 2829 +
NODE_100_length_4674_cov_
7.14051_ID_199 ...
fig|1648.160.peg.4 peg NODE_100_length_4674_cov_
7.14051_ID_199_3666_2881 ...
3666 2881 -
NODE_100_length_4674_cov_
7.14051_ID_199 ...
fig|1648.160.peg.5 peg NODE_100_length_4674_cov_
7.14051_ID_199_4297_3758 ...
4297 3758 -
NODE_101_length_4672_cov_
9.29831_ID_201 ...
fig|1648.160.peg.6 peg NODE_101_length_4672_cov_
9.29831_ID_201_4524_3769 ...
4524 3769 -
NODE_102_length_4655_cov_
7.745_ID_203 ...
fig|1648.160.peg.7 peg NODE_102_length_4655_cov_
7.745_ID_203_1744_830 ...
1744 830 -
NODE_102_length_4655_cov_
7.745_ID_203 ...
fig|1648.160.peg.8 peg NODE_102_length_4655_cov_
7.745_ID_203_2906_1728 ...
2906 1728 -
NODE_102_length_4655_cov_
7.745_ID_203 ...
fig|1648.160.peg.9 peg NODE_102_length_4655_cov_
7.745_ID_203_4549_2894 ...
4549 2894 -
NODE_103_length_4645_cov_
7.86383_ID_205 ...
fig|1648.160.peg.10 peg NODE_103_length_4645_cov_
7.86383_ID_205_674_1138 ...
674 1138 +
function aliases figfam evidence_codes nucleotide_sequence
Putative hemagglutinin
/hemolysin-related ...
atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
ABC transporter, ATP-
binding protein ...
FIG00744535 ff atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
ABC transporter, permease
protein (putative) ...
atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
FIG01308195 ff atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
hypothetical protein atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
COG1242: Predicted Fe-S
oxidoreductase ...
FIG00001867 isu;An_Fe-S_oxireductase_
coupled_with_a_methyl ...
atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
Site-specific recombinase
XerD ...
atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
DNA polymerase IV (EC
2.7.7.7) ...
FIG00023943 icw(1);DNA_repair,_bacter
ial ...
atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
DNA repair protein RecN FIG00000317 isu;DNA-replication
isu;DNA_repair,_bacte ...
atgcttactcatttaacgatagata
attttgttttgattcatcata ...
Phage protein atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
aa_sequence Calculated_length_nt Gene_GC Calculated_length_aa
MTVNEDVTLYAQWKKLIVVVDTYTV
VYDGNGNTSGNAPIDSHPYEK ...
492 0.355691056911 163
MKLEFNQINKFFDEHHVLKDISFEV
NSGQIFGYLGRNGAGKTTSIR ...
909 0.324532453245 302
MNKLKTVFKFEFLEMLRKRSVKVTT
LILCIAVLLITSVPTIQSVFM ...
1236 0.330906148867 411
MKKLKLALIGPGFLNDIVAQAWVDG
YLPEYELVGVLGRNPIRTAAF ...
786 0.40203562341 261
MENHVIIYQTKEPKLHTLDACIQDF
TKRYKLDNTWYLYKTANGKPI ...
540 0.322222222222 179
MILKNPDLDKQIEHSENIMLQKWPN
AAKIAYFQAYSNTHDSLKNLK ...
756 0.382275132275 251
MKNKANDFIHYLQFIDPKAALTVAS
YRNDLNQYIEYLEHESVSSLE ...
915 0.343169398907 304
MAQVIFHIDINAFYASAHLITDSSL
YGKPVVVCSNQRGSVVTTASY ...
1179 0.367260390161 392
MLTHLTIDNFVLIHHISVDFSDRFN
VFTGETGAGKSLLVDALNFVS ...
1656 0.362922705314 551
MSNKRGKYVKWLEPKNLLLLESWAR
KISNEQIAANIGIAPKTLYAW ...
465 0.406451612903 154
[1575 rows x 16 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Separating RNA records


In [25]:
rna_Banks969_Li_SFrame=Banks969_Li_SFrame.filter_by('rna','type')
rna_Banks969_Li_SFrame


Out[25]:
contig_id feature_id type location start stop strand
NODE_11_length_9738_cov_8
.40958_ID_21 ...
fig|1648.160.rna.1 rna NODE_11_length_9738_cov_8
.40958_ID_21_9726_9656 ...
9726 9656 -
NODE_155_length_3642_cov_
7.81071_ID_309 ...
fig|1648.160.rna.2 rna NODE_155_length_3642_cov_
7.81071_ID_309_3325_3395 ...
3325 3395 +
NODE_160_length_3510_cov_
8.02605_ID_319 ...
fig|1648.160.rna.3 rna NODE_160_length_3510_cov_
8.02605_ID_319_2002_1930 ...
2002 1930 -
NODE_183_length_3178_cov_
10.617_ID_365 ...
fig|1648.160.rna.4 rna NODE_183_length_3178_cov_
10.617_ID_365_2494_2422 ...
2494 2422 -
NODE_183_length_3178_cov_
10.617_ID_365 ...
fig|1648.160.rna.5 rna NODE_183_length_3178_cov_
10.617_ID_365_2577_2505 ...
2577 2505 -
NODE_183_length_3178_cov_
10.617_ID_365 ...
fig|1648.160.rna.6 rna NODE_183_length_3178_cov_
10.617_ID_365_2827_2755 ...
2827 2755 -
NODE_183_length_3178_cov_
10.617_ID_365 ...
fig|1648.160.rna.7 rna NODE_183_length_3178_cov_
10.617_ID_365_2910_2838 ...
2910 2838 -
NODE_184_length_3174_cov_
6.24174_ID_367 ...
fig|1648.160.rna.8 rna NODE_184_length_3174_cov_
6.24174_ID_367_3041_2969 ...
3041 2969 -
NODE_197_length_2970_cov_
11.1692_ID_393 ...
fig|1648.160.rna.9 rna NODE_197_length_2970_cov_
11.1692_ID_393_1043_971 ...
1043 971 -
NODE_197_length_2970_cov_
11.1692_ID_393 ...
fig|1648.160.rna.10 rna NODE_197_length_2970_cov_
11.1692_ID_393_1143_1070 ...
1143 1070 -
function aliases figfam evidence_codes nucleotide_sequence aa_sequence Calculated_length_nt
tRNA-Thr-GGT gccgccatagttcaatggtaaaata
gagcaatggtaatgctcagtt ...
71
tRNA-Trp-CCA aggggtatagttcaatggtagagca
acggtctccaaaaccgttaat ...
71
tRNA-Thr-CGT gccggcttagctcaattggtagagc
aactgaatcgtaatcagtagg ...
73
tRNA-Glu-TTC ggcccgttggagaaacggttaactc
acatgcctttcacgcatgcat ...
73
tRNA-Asn-GTT gcttgtttagctcagtcggtagagc
aactggctgttaaccagtggg ...
73
tRNA-Glu-TTC ggcccgttggagaaacggttaactc
acatgcctttcacgcatgcat ...
73
tRNA-Asn-GTT gcttgtttagctcagtcggtagagc
aactggctgttaaccagtggg ...
73
tRNA-Lys-CTT gctccgttagctcagtcggtagagc
aactgactcttaatcagtggg ...
73
tRNA-Ala-TGC ggggttttagctcagctgggagagc
gcctgccttgcacgcaggagg ...
73
tRNA-Ile-GAT gggcctgtagctcagttggttagag
cgcacccctgataagggtgag ...
74
Gene_GC Calculated_length_aa
0.492957746479 0
0.507042253521 0
0.520547945205 0
0.547945205479 0
0.547945205479 0
0.547945205479 0
0.547945205479 0
0.575342465753 0
0.602739726027 0
0.581081081081 0
[55 rows x 16 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Separating coding records


In [26]:
coding_Banks969_Li_SFrame=Banks969_Li_SFrame.filter_by('peg','type')
coding_Banks969_Li_SFrame.show()


Canvas is accessible via web browser at the URL: http://localhost:64100/index.html
Opening Canvas in default web browser.

Exporting Data


In [30]:
gene_Banks969_Li_gc_SFrame=coding_Banks969_Li_SFrame.select_columns(['feature_id','Calculated_length_nt','Gene_GC'])
gene_Banks969_Li_gc_SFrame


Out[30]:
feature_id Calculated_length_nt Gene_GC
fig|1648.160.peg.1 492 0.355691056911
fig|1648.160.peg.2 909 0.324532453245
fig|1648.160.peg.3 1236 0.330906148867
fig|1648.160.peg.4 786 0.40203562341
fig|1648.160.peg.5 540 0.322222222222
fig|1648.160.peg.6 756 0.382275132275
fig|1648.160.peg.7 915 0.343169398907
fig|1648.160.peg.8 1179 0.367260390161
fig|1648.160.peg.9 1656 0.362922705314
fig|1648.160.peg.10 465 0.406451612903
[1520 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exporting to csv


In [31]:
gene_Banks969_Li_gc_SFrame.export_csv('gene_Banks969_Li_gc.tsv','\t')

Exporting to dictionary/fastA file


In [33]:
gene_Banks969_Li_gc_SFrame[0]


Out[33]:
{'Calculated_length_nt': 492L,
 'Gene_GC': 0.3556910569105691,
 'feature_id': 'fig|1648.160.peg.1'}

Forming a record of the following pattern: ">ID| function| [Strain]


In [56]:
coding_Banks969_Li_aa_dict={}
for item in coding_Banks969_Li_SFrame:
    coding_Banks969_Li_aa_dict['>'+item['feature_id']+'| '+item['function']+'| '+'[Erysipelothrix rhusiopathiae Banks969-Li]']=item['aa_sequence']
    #print item['function']

for i in range(len(coding_Banks969_Li_aa_dict)):
    print coding_Banks969_Li_aa_dict[][i]


  File "<ipython-input-56-aa03a4ea185d>", line 7
    print coding_Banks969_Li_aa_dict[][i]
                                     ^
SyntaxError: invalid syntax

Writing to a faa file


In [60]:
out_file='Banks969_Li_annotated.faa'
hand_out=open(out_file,'w')

for item in coding_Banks969_Li_aa_dict:
   hand_out.write(item+'\n'+coding_Banks969_Li_aa_dict[item])
hand_out.close()
print len(coding_Banks969_Li_aa_dict), type(coding_Banks969_Li_aa_dict)


1520 <type 'dict'>

In [61]:
out_file='Banks969_Li_annotated_1.faa'
hand_out=open(out_file,'w')
coding_Banks969_Li_aa_headers=coding_Banks969_Li_aa_dict.keys()
coding_Banks969_Li_aa_sequences=coding_Banks969_Li_aa_dict.values()
for i in range(len(coding_Banks969_Li_aa_dict)):
    hand_out.write(coding_Banks969_Li_aa_headers[i]+'\n'+coding_Banks969_Li_aa_sequences[i])
hand_out.close()
print len(coding_Banks969_Li_aa_dict), type(coding_Banks969_Li_aa_dict)
print len(coding_Banks969_Li_aa_headers), type(coding_Banks969_Li_aa_headers)
print len(coding_Banks969_Li_aa_sequences), type(coding_Banks969_Li_aa_sequences)


1520 <type 'dict'>
1520 <type 'list'>
1520 <type 'list'>

Correcting for the newline characters between the entries but not after the last line


In [65]:
out_file='Banks969_Li_annotated_optimized.faa'
hand_out=open(out_file,'w')
coding_Banks969_Li_aa_headers=coding_Banks969_Li_aa_dict.keys()
coding_Banks969_Li_aa_sequences=coding_Banks969_Li_aa_dict.values()
for i in range(len(coding_Banks969_Li_aa_dict)):
    if (i<(len(coding_Banks969_Li_aa_dict)-1)):
        hand_out.write(coding_Banks969_Li_aa_headers[i]+'\n'+coding_Banks969_Li_aa_sequences[i]+'\n')
    else:
        hand_out.write(coding_Banks969_Li_aa_headers[i]+'\n'+coding_Banks969_Li_aa_sequences[i])
hand_out.close()
print len(coding_Banks969_Li_aa_dict), type(coding_Banks969_Li_aa_dict)
print len(coding_Banks969_Li_aa_headers), type(coding_Banks969_Li_aa_headers)
print len(coding_Banks969_Li_aa_sequences), type(coding_Banks969_Li_aa_sequences)


1520 <type 'dict'>
1520 <type 'list'>
1520 <type 'list'>

In [ ]:

Create a dictionary containing the data


In [31]:
dict_nt={}

for item in lst_nt:
    lst_item=item.split('\n')
    dict_nt[lst_item[0].strip()]=''.join(lst_item[1:]).strip()
#dict_nt ##looks ok but too large to display

In [33]:
dict_aa={}
for item in lst_aa:
    lst_item=item.split('\n')
    dict_aa[lst_item[0].strip()]=''.join(lst_item[1:]).strip()
#dict_aa ###too large to display

Creating index for table1 to insert nucleotides: indexed_df = df.set_index(['A', 'B'])


In [13]:
#table1_addedColumn=table1
#table1_addedColumn['Seq_nt']=table1_addedColumn['NCBI GI']#just creating an empty column
#indexed_table1=table1.to_dataframe()#converting to Dataframe and indexing
#indexed_table1.set_index('Feature ID')

In [48]:
sf=graphlab.SFrame({'Feature ID':dict_nt.keys(),'Seq_nt':dict_nt.values()})
#sf['Feature ID']
aa_SFrame=graphlab.SFrame({'Feature ID':dict_aa.keys(),'Seq_aa':dict_aa.values()})
aa_SFrame


Out[48]:
Feature ID Seq_aa
fig|1648.160.peg.1398 MKLTDKQLKIIETYGADTPESFLMV
FPYRYERLEAKPRDAWRPGDT ...
fig|1648.160.peg.1399 MTKWDLLDSMKIPYFNKDLINNAFI
HSSYVNEAEELLEDNERLEFM ...
fig|1648.160.peg.1428 MVIHFDLAQKEAKIKELESEMVEDG
FWDDHRHSSQHIQKLNQLKKV ...
fig|1648.160.peg.1429 MAGFLSNLFSGDRKILNEIEKIAHE
IDALADETRALSDEALKEKTN ...
fig|1648.160.peg.1394 MPKLDILNLEGKSLRELELNEAVFG
IEPNNQTIFEAVVMQQASLRQ ...
fig|1648.160.peg.1395 MKGLLGRKLGMTQVFTTDGKLIPVS
VVEVLPNVVLQKKTMESDNYE ...
fig|1648.160.peg.1396 MAKNFIRIRLKAYEHRTIDSAAQKI
VQAANDHGAQKVVGPVPLPTE ...
fig|1648.160.peg.1397 MAAKQAQDICEDKDIHVIETKTIPQ
GLSACVMFNPDVDVEDNLTEM ...
fig|1648.160.peg.1422 MVPYESTVFTNYQTIARLGLLNTYT
ALIVPSLASVFYIFYLKGYLT ...
fig|1648.160.peg.1391 MSRSLKKGPFCDLHLMNKVEKLNAE
SKKEVIKTWSRRSTIFPQFVE ...
[1520 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Adding a column and inserting nucleotide data


In [49]:
table1_merge=table1
#table1_merge['Feature ID']

In [50]:
sf


Out[50]:
Feature ID Seq_nt
fig|1648.160.peg.1398 atgaaactcacggataaacaattaa
aaattattgaaacttatggtg ...
fig|1648.160.peg.1399 atgacaaaatgggatttattggatt
ctatgaagattccatacttta ...
fig|1648.160.peg.1428 ttggtgattcactttgacttagctc
aaaaagaagctaaaatcaaag ...
fig|1648.160.peg.1429 atggctggatttttaagtaatcttt
ttagtggcgatcgtaagattt ...
fig|1648.160.peg.1394 atgcctaagttagatattttaaacc
ttgaaggcaagagtcttcgtg ...
fig|1648.160.peg.1395 atgaaaggattactaggacgtaaat
taggaatgacacaagtcttca ...
fig|1648.160.peg.1396 atggcaaagaattttattcgtattc
gtttaaaagcttatgaacacc ...
fig|1648.160.peg.1397 atggctgcaaaacaagctcaagata
tttgtgaagacaaagatatac ...
fig|1648.160.peg.1390 atggaagtaaaagctatagttaaaa
cagttcgtgttactccacgta ...
fig|1648.160.peg.1391 atgagtcgtagtttaaaaaaaggac
cattttgtgatcttcatttaa ...
[1520 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Trying to merge


In [51]:
mergedTbl1=table1.join(sf, how='left')
mergedTbl1.show()


Canvas is updated and available in a tab in the default browser.

Defining user functions


In [52]:


In [53]:
mergedTbl1["Calculated_length_nt"]=mergedTbl1['Seq_nt'].apply(returnLen)
mergedTbl1


Out[53]:
Feature ID Type Contig Start Stop Frame Strand Length (bp)
fig|1648.160.peg.1 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
39 530 3 + 492
fig|1648.160.peg.2 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
693 1601 3 + 909
fig|1648.160.peg.3 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
1594 2829 1 + 1236
fig|1648.160.peg.4 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
3666 2881 -3 - 786
fig|1648.160.peg.5 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
4297 3758 -1 - 540
fig|1648.160.peg.6 CDS NODE_101_length_4672_cov_
9.29831_ID_201 ...
4524 3769 -3 - 756
fig|1648.160.peg.7 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
1744 830 -1 - 915
fig|1648.160.peg.8 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
2906 1728 -2 - 1179
fig|1648.160.peg.9 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
4549 2894 -1 - 1656
fig|1648.160.peg.10 CDS NODE_103_length_4645_cov_
7.86383_ID_205 ...
674 1138 2 + 465
Function Subsystem NCBI GI locus Seq_nt Calculated_length_nt
Putative hemagglutinin
/hemolysin-related ...
- none - atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
492
ABC transporter, ATP-
binding protein ...
- none - atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
909
ABC transporter, permease
protein (putative) ...
- none - atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
1236
L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
- none - atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
786
hypothetical protein - none - atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
540
COG1242: Predicted Fe-S
oxidoreductase ...
An Fe-S oxireductase
coupled with a ...
atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
756
Site-specific recombinase
XerD ...
- none - atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
915
DNA polymerase IV (EC
2.7.7.7) ...
DNA repair, bacterial atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
1179
DNA repair protein RecN DNA-replication; <br>DNA
repair, bacterial ...
atgcttactcatttaacgatagata
attttgttttgattcatcata ...
1656
Phage protein - none - atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
465
[1575 rows x 14 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [ ]:

Inserting GC content data for every gene


In [54]:
mergedTbl1["Gene_GC"]=mergedTbl1['Seq_nt'].apply(gc)
mergedTbl1


Out[54]:
Feature ID Type Contig Start Stop Frame Strand Length (bp)
fig|1648.160.peg.1 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
39 530 3 + 492
fig|1648.160.peg.2 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
693 1601 3 + 909
fig|1648.160.peg.3 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
1594 2829 1 + 1236
fig|1648.160.peg.4 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
3666 2881 -3 - 786
fig|1648.160.peg.5 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
4297 3758 -1 - 540
fig|1648.160.peg.6 CDS NODE_101_length_4672_cov_
9.29831_ID_201 ...
4524 3769 -3 - 756
fig|1648.160.peg.7 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
1744 830 -1 - 915
fig|1648.160.peg.8 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
2906 1728 -2 - 1179
fig|1648.160.peg.9 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
4549 2894 -1 - 1656
fig|1648.160.peg.10 CDS NODE_103_length_4645_cov_
7.86383_ID_205 ...
674 1138 2 + 465
Function Subsystem NCBI GI locus Seq_nt Calculated_length_nt
Putative hemagglutinin
/hemolysin-related ...
- none - atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
492
ABC transporter, ATP-
binding protein ...
- none - atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
909
ABC transporter, permease
protein (putative) ...
- none - atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
1236
L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
- none - atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
786
hypothetical protein - none - atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
540
COG1242: Predicted Fe-S
oxidoreductase ...
An Fe-S oxireductase
coupled with a ...
atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
756
Site-specific recombinase
XerD ...
- none - atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
915
DNA polymerase IV (EC
2.7.7.7) ...
DNA repair, bacterial atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
1179
DNA repair protein RecN DNA-replication; <br>DNA
repair, bacterial ...
atgcttactcatttaacgatagata
attttgttttgattcatcata ...
1656
Phage protein - none - atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
465
Gene_GC
0.355691056911
0.324532453245
0.330906148867
0.40203562341
0.322222222222
0.382275132275
0.343169398907
0.367260390161
0.362922705314
0.406451612903
[1575 rows x 15 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Adding Amino Acid records

Merging two SFrames


In [55]:
mergedTbl1=mergedTbl1.join(aa_SFrame, how='left')
mergedTbl1 #OK


Out[55]:
Feature ID Type Contig Start Stop Frame Strand Length (bp)
fig|1648.160.peg.1 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
39 530 3 + 492
fig|1648.160.peg.2 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
693 1601 3 + 909
fig|1648.160.peg.3 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
1594 2829 1 + 1236
fig|1648.160.peg.4 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
3666 2881 -3 - 786
fig|1648.160.peg.5 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
4297 3758 -1 - 540
fig|1648.160.peg.6 CDS NODE_101_length_4672_cov_
9.29831_ID_201 ...
4524 3769 -3 - 756
fig|1648.160.peg.7 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
1744 830 -1 - 915
fig|1648.160.peg.8 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
2906 1728 -2 - 1179
fig|1648.160.peg.9 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
4549 2894 -1 - 1656
fig|1648.160.peg.10 CDS NODE_103_length_4645_cov_
7.86383_ID_205 ...
674 1138 2 + 465
Function Subsystem NCBI GI locus Seq_nt Calculated_length_nt
Putative hemagglutinin
/hemolysin-related ...
- none - atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
492
ABC transporter, ATP-
binding protein ...
- none - atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
909
ABC transporter, permease
protein (putative) ...
- none - atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
1236
L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
- none - atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
786
hypothetical protein - none - atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
540
COG1242: Predicted Fe-S
oxidoreductase ...
An Fe-S oxireductase
coupled with a ...
atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
756
Site-specific recombinase
XerD ...
- none - atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
915
DNA polymerase IV (EC
2.7.7.7) ...
DNA repair, bacterial atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
1179
DNA repair protein RecN DNA-replication; <br>DNA
repair, bacterial ...
atgcttactcatttaacgatagata
attttgttttgattcatcata ...
1656
Phage protein - none - atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
465
Gene_GC Seq_aa
0.355691056911 MTVNEDVTLYAQWKKLIVVVDTYTV
VYDGNGNTSGNAPIDSHPYEK ...
0.324532453245 MKLEFNQINKFFDEHHVLKDISFEV
NSGQIFGYLGRNGAGKTTSIR ...
0.330906148867 MNKLKTVFKFEFLEMLRKRSVKVTT
LILCIAVLLITSVPTIQSVFM ...
0.40203562341 MKKLKLALIGPGFLNDIVAQAWVDG
YLPEYELVGVLGRNPIRTAAF ...
0.322222222222 MENHVIIYQTKEPKLHTLDACIQDF
TKRYKLDNTWYLYKTANGKPI ...
0.382275132275 MILKNPDLDKQIEHSENIMLQKWPN
AAKIAYFQAYSNTHDSLKNLK ...
0.343169398907 MKNKANDFIHYLQFIDPKAALTVAS
YRNDLNQYIEYLEHESVSSLE ...
0.367260390161 MAQVIFHIDINAFYASAHLITDSSL
YGKPVVVCSNQRGSVVTTASY ...
0.362922705314 MLTHLTIDNFVLIHHISVDFSDRFN
VFTGETGAGKSLLVDALNFVS ...
0.406451612903 MSNKRGKYVKWLEPKNLLLLESWAR
KISNEQIAANIGIAPKTLYAW ...
[1575 rows x 16 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Adding aa length column


In [56]:
mergedTbl1["Calculated_length_aa"]=mergedTbl1['Seq_aa'].apply(returnLen)
mergedTbl1


Out[56]:
Feature ID Type Contig Start Stop Frame Strand Length (bp)
fig|1648.160.peg.1 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
39 530 3 + 492
fig|1648.160.peg.2 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
693 1601 3 + 909
fig|1648.160.peg.3 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
1594 2829 1 + 1236
fig|1648.160.peg.4 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
3666 2881 -3 - 786
fig|1648.160.peg.5 CDS NODE_100_length_4674_cov_
7.14051_ID_199 ...
4297 3758 -1 - 540
fig|1648.160.peg.6 CDS NODE_101_length_4672_cov_
9.29831_ID_201 ...
4524 3769 -3 - 756
fig|1648.160.peg.7 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
1744 830 -1 - 915
fig|1648.160.peg.8 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
2906 1728 -2 - 1179
fig|1648.160.peg.9 CDS NODE_102_length_4655_cov_
7.745_ID_203 ...
4549 2894 -1 - 1656
fig|1648.160.peg.10 CDS NODE_103_length_4645_cov_
7.86383_ID_205 ...
674 1138 2 + 465
Function Subsystem NCBI GI locus Seq_nt Calculated_length_nt
Putative hemagglutinin
/hemolysin-related ...
- none - atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
492
ABC transporter, ATP-
binding protein ...
- none - atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
909
ABC transporter, permease
protein (putative) ...
- none - atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
1236
L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
- none - atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
786
hypothetical protein - none - atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
540
COG1242: Predicted Fe-S
oxidoreductase ...
An Fe-S oxireductase
coupled with a ...
atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
756
Site-specific recombinase
XerD ...
- none - atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
915
DNA polymerase IV (EC
2.7.7.7) ...
DNA repair, bacterial atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
1179
DNA repair protein RecN DNA-replication; <br>DNA
repair, bacterial ...
atgcttactcatttaacgatagata
attttgttttgattcatcata ...
1656
Phage protein - none - atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
465
Gene_GC Seq_aa Calculated_length_aa
0.355691056911 MTVNEDVTLYAQWKKLIVVVDTYTV
VYDGNGNTSGNAPIDSHPYEK ...
163
0.324532453245 MKLEFNQINKFFDEHHVLKDISFEV
NSGQIFGYLGRNGAGKTTSIR ...
302
0.330906148867 MNKLKTVFKFEFLEMLRKRSVKVTT
LILCIAVLLITSVPTIQSVFM ...
411
0.40203562341 MKKLKLALIGPGFLNDIVAQAWVDG
YLPEYELVGVLGRNPIRTAAF ...
261
0.322222222222 MENHVIIYQTKEPKLHTLDACIQDF
TKRYKLDNTWYLYKTANGKPI ...
179
0.382275132275 MILKNPDLDKQIEHSENIMLQKWPN
AAKIAYFQAYSNTHDSLKNLK ...
251
0.343169398907 MKNKANDFIHYLQFIDPKAALTVAS
YRNDLNQYIEYLEHESVSSLE ...
304
0.367260390161 MAQVIFHIDINAFYASAHLITDSSL
YGKPVVVCSNQRGSVVTTASY ...
392
0.362922705314 MLTHLTIDNFVLIHHISVDFSDRFN
VFTGETGAGKSLLVDALNFVS ...
551
0.406451612903 MSNKRGKYVKWLEPKNLLLLESWAR
KISNEQIAANIGIAPKTLYAW ...
154
[1575 rows x 17 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [ ]:


In [ ]:


In [58]:
graphlab.canvas.set_target('ipynb')
mergedTbl1['Gene_GC'].show()



In [ ]:


In [59]:
from Bio import SeqIO


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-59-8f71aa6c445c> in <module>()
----> 1 from Bio import SeqIO

ImportError: No module named Bio

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [5]:
sf.head()


Out[5]:
First Name Last Name Country age
Bob Smith United States 24
Alice Williams Canada 23
Malcolm Jone England 22
Felix Brown USA 23
Alex Cooper Poland 23
Tod Campbell United States 22
Derek Ward Switzerland 25
[7 rows x 4 columns]


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [6]:
sf.tail()


Out[6]:
First Name Last Name Country age
Bob Smith United States 24
Alice Williams Canada 23
Malcolm Jone England 22
Felix Brown USA 23
Alex Cooper Poland 23
Tod Campbell United States 22
Derek Ward Switzerland 25
[7 rows x 4 columns]

GraphLab Canvas


In [7]:
#Any data structure in GraphLab Create
sf.show()
table1.show()


Canvas is accessible via web browser at the URL: http://localhost:60305/index.html
Opening Canvas in default web browser.
Canvas is accessible via web browser at the URL: http://localhost:60305/index.html
Opening Canvas in default web browser.

In [8]:
graphlab.canvas.set_target('ipynb')

In [11]:
sf['age'].show(view="Categorical")


Inspect some columns of dataset


In [9]:
sf["Country"]
table1["Length (bp)"]


Out[9]:
dtype: int
Rows: 1575
[492L, 909L, 1236L, 786L, 540L, 756L, 915L, 1179L, 1656L, 465L, 1374L, 1329L, 1029L, 819L, 603L, 534L, 651L, 2373L, 969L, 324L, 672L, 1323L, 150L, 450L, 1089L, 1047L, 1146L, 480L, 708L, 309L, 792L, 2433L, 306L, 273L, 249L, 498L, 723L, 1305L, 384L, 1395L, 1368L, 1182L, 1533L, 732L, 867L, 645L, 1011L, 1092L, 813L, 360L, 1092L, 903L, 681L, 123L, 804L, 615L, 399L, 135L, 1212L, 1851L, 1044L, 1281L, 900L, 1566L, 1110L, 537L, 1644L, 114L, 207L, 435L, 330L, 723L, 1155L, 837L, 789L, 1071L, 537L, 729L, 693L, 1539L, 159L, 192L, 330L, 657L, 1335L, 726L, 564L, 597L, 615L, 756L, 549L, 759L, 621L, 489L, 564L, 570L, 4092L, 729L, 999L, 996L, ... ]

In [13]:
sf['age']


Out[13]:
dtype: int
Rows: 7
[24L, 23L, 22L, 23L, 23L, 22L, 25L]

In [14]:
sf['age'].mean()


Out[14]:
23.142857142857146

In [15]:
sf['age'].max()


Out[15]:
25L

Create new columns in our SFrame


In [10]:
sf["Full Name"]=sf['First Name']+' '+sf['Last Name']


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-7a7c46285d97> in <module>()
      1 sf["Full Name"]=sf['First Name']+' '+sf['Last Name']
----> 2 table1["Seq_nt"]

C:\Anaconda\lib\site-packages\graphlab\data_structures\sframe.pyc in __getitem__(self, key)
   3536             return self._row_selector(key)
   3537         elif type(key) is str:
-> 3538             return self.select_column(key)
   3539         elif type(key) is type:
   3540             return self.select_columns([key])

C:\Anaconda\lib\site-packages\graphlab\data_structures\sframe.pyc in select_column(self, key)
   3141             raise TypeError("Invalid key type: must be str")
   3142         with cython_context():
-> 3143             return SArray(data=[], _proxy=self.__proxy__.select_column(key))
   3144 
   3145     def select_columns(self, keylist):

C:\Anaconda\lib\site-packages\graphlab\cython\context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Column name Seq_nt does not exist.

In [18]:
sf


Out[18]:
First Name Last Name Country age Full Name
Bob Smith United States 24 Bob Smith
Alice Williams Canada 23 Alice Williams
Malcolm Jone England 22 Malcolm Jone
Felix Brown USA 23 Felix Brown
Alex Cooper Poland 23 Alex Cooper
Tod Campbell United States 22 Tod Campbell
Derek Ward Switzerland 25 Derek Ward
[7 rows x 5 columns]


In [20]:
sf['age']+2


Out[20]:
dtype: int
Rows: 7
[26L, 25L, 24L, 25L, 25L, 24L, 27L]

In [21]:
sf['age']=sf['age']+2

In [22]:
sf


Out[22]:
First Name Last Name Country age Full Name
Bob Smith United States 26 Bob Smith
Alice Williams Canada 25 Alice Williams
Malcolm Jone England 24 Malcolm Jone
Felix Brown USA 25 Felix Brown
Alex Cooper Poland 25 Alex Cooper
Tod Campbell United States 24 Tod Campbell
Derek Ward Switzerland 27 Derek Ward
[7 rows x 5 columns]

Use the apply function to do an advanced transformation of our data


In [23]:
sfpp['Country']


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-23-af42b1b35ea2> in <module>()
----> 1 sfpp['Country']

NameError: name 'sfpp' is not defined

In [24]:
sf['Country']


Out[24]:
dtype: str
Rows: 7
['United States', 'Canada', 'England', 'USA', 'Poland', 'United States', 'Switzerland']

In [25]:
sf['Country'].show()



In [26]:
def transform_country(country):
    if country == 'USA':
        return 'United States'
    else:
        return country

In [27]:
transform_country('Brazil')


Out[27]:
'Brazil'

In [28]:
transform_country('USA')


Out[28]:
'United States'

In [29]:
sf['Country'].apply(transform.country)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-ebb4bbbf4ccf> in <module>()
----> 1 sf['Country'].apply(transform.country)

NameError: name 'transform' is not defined

In [30]:
sf['Country'].apply(transform_country)


Out[30]:
dtype: str
Rows: 7
['United States', 'Canada', 'England', 'United States', 'Poland', 'United States', 'Switzerland']

In [31]:
sf['Country']=sf['Country'].apply(transform_country)

In [32]:
sf


Out[32]:
First Name Last Name Country age Full Name
Bob Smith United States 26 Bob Smith
Alice Williams Canada 25 Alice Williams
Malcolm Jone England 24 Malcolm Jone
Felix Brown United States 25 Felix Brown
Alex Cooper Poland 25 Alex Cooper
Tod Campbell United States 24 Tod Campbell
Derek Ward Switzerland 27 Derek Ward
[7 rows x 5 columns]


In [ ]: