Fire up GraphLab Create



In [1]:

    
import graphlab
import pandas
from Bio import SeqIO









    



A newer version of GraphLab Create (v1.7.1) is available! Your current version is v1.6.1.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.

Load a tabular data set



In [2]:

    
sf=graphlab.SFrame('people-example.csv')
table1=graphlab.SFrame('table1.tsv')
hand_nt=open('1648.160_nt.fna', 'r')
hand_aa=open('1648.160.faa', 'r')

table1_nt=hand_nt.read()
table1_aa=hand_aa.read()

hand_nt.close()
hand_aa.close()









    



[INFO] This non-commercial license of GraphLab Create is assigned to andriy.sheremet@ucalgary.ca and will expire on October 26, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-11240 - Server binary: C:\Anaconda\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\Andriy\AppData\Local\Temp\graphlab_server_1448140378.log.0
[INFO] GraphLab Server Version: 1.6.1






    



PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\people-example.csv
PROGRESS: Parsing completed. Parsed 7 lines in 0.03125 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\people-example.csv
PROGRESS: Parsing completed. Parsed 7 lines in 0.015637 secs.
PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\table1.tsv
PROGRESS: Parsing completed. Parsed 100 lines in 0.015624 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,long,long,long,str,long,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file D:\library\lessons\CMMB 637 Advanced Topics in Molecular Microbiology\genome_annot\table1.tsv
PROGRESS: Parsing completed. Parsed 1575 lines in 0.015625 secs.

Defining user functions



In [12]:

    
def returnLen(text):
    return len(text)

def gc(dna):
    "This function computes the GC ratio of a dna sequence"
    nbases=dna.count('n')+dna.count('N')
    gc_cont=float(dna.count('c')+dna.count('C')+dna.count('g')+dna.count('G'))/(len(dna)-nbases)
    return gc_cont

Processing FastA files



In [30]:

    
sf #we can view first few lines of the table
#table1_aa
#table1_nt
lst_aa=table1_aa.split('>')
lst_nt=table1_nt.split('>')
lst_nt=lst_nt[1:]#first (No. 0) list item is empty
#dict_nt
lst_aa=lst_aa[1:]# the same for amino acids

Converting gbk-file



In [3]:

    
SeqIO.convert("1648.160.gbk", "genbank", "1648.160_To_FastA.fna", "fasta") #output is not so friendly









    Out[3]:





723

Trying to improve

http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/



In [7]:

    
#hand_gb=open('1648.160.gbk','r')
#out_file = "1648.160_To_FastA_parced.fna"
#hand_out = open(out_file, "w")

#for seq_record in SeqIO.parse(hand_gb, "genbank") :
#    print "Dealing with GenBank record %s" % seq_record.id
#    hand_out.write(">%s %s\n%s\n" % (
#           seq_record.id,
#           seq_record.description,
#           seq_record.seq))

#hand_out.close()
#hand_gb.close()

This seems even better that all I've done before: there's one file containing all the data

tab separated text format



In [2]:

    
Banks969_Li_SFrame=graphlab.SFrame('6666666.156894.tsv')
Banks969_Li_SFrame
Banks969_Li_SFrame.show()









    



[INFO] This non-commercial license of GraphLab Create is assigned to andriy.sheremet@ucalgary.ca and will expire on October 26, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-10556 - Server binary: C:\Anaconda\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\Andriy\AppData\Local\Temp\graphlab_server_1448764274.log.0
[INFO] GraphLab Server Version: 1.6.1






    



PROGRESS: Finished parsing file D:\library\lessons\CMMB 549 Microbial Genetics\plasmidAnnot\plasmid_2015\6666666.156894.tsv
PROGRESS: Parsing completed. Parsed 100 lines in 0.02802 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,str,long,long,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file D:\library\lessons\CMMB 549 Microbial Genetics\plasmidAnnot\plasmid_2015\6666666.156894.tsv
PROGRESS: Parsing completed. Parsed 576 lines in 0.017013 secs.
Canvas is accessible via web browser at the URL: http://localhost:64154/index.html
Opening Canvas in default web browser.

Adding new columns: Nt_length, GC content, Aa_length

!Attention: if used this way, RNA data, specifically, GC content will be included in the calculations filter it out if it isn't needed, or use data obtained from convertion of fastA files



In [23]:

    
Banks969_Li_SFrame["Calculated_length_nt"]=Banks969_Li_SFrame['nucleotide_sequence'].apply(returnLen)
Banks969_Li_SFrame["Gene_GC"]=Banks969_Li_SFrame['nucleotide_sequence'].apply(gc)
Banks969_Li_SFrame["Calculated_length_aa"]=Banks969_Li_SFrame['aa_sequence'].apply(returnLen)
Banks969_Li_SFrame









    Out[23]:





    
        contig_id
        feature_id
        type
        location
        start
        stop
        strand
    
    
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        fig|1648.160.peg.1
        peg
        NODE_100_length_4674_cov_
7.14051_ID_199_39_530 ...
        39
        530
        +
    
    
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        fig|1648.160.peg.2
        peg
        NODE_100_length_4674_cov_
7.14051_ID_199_693_1601 ...
        693
        1601
        +
    
    
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        fig|1648.160.peg.3
        peg
        NODE_100_length_4674_cov_
7.14051_ID_199_1594_2829 ...
        1594
        2829
        +
    
    
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        fig|1648.160.peg.4
        peg
        NODE_100_length_4674_cov_
7.14051_ID_199_3666_2881 ...
        3666
        2881
        -
    
    
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        fig|1648.160.peg.5
        peg
        NODE_100_length_4674_cov_
7.14051_ID_199_4297_3758 ...
        4297
        3758
        -
    
    
        NODE_101_length_4672_cov_
9.29831_ID_201 ...
        fig|1648.160.peg.6
        peg
        NODE_101_length_4672_cov_
9.29831_ID_201_4524_3769 ...
        4524
        3769
        -
    
    
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        fig|1648.160.peg.7
        peg
        NODE_102_length_4655_cov_
7.745_ID_203_1744_830 ...
        1744
        830
        -
    
    
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        fig|1648.160.peg.8
        peg
        NODE_102_length_4655_cov_
7.745_ID_203_2906_1728 ...
        2906
        1728
        -
    
    
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        fig|1648.160.peg.9
        peg
        NODE_102_length_4655_cov_
7.745_ID_203_4549_2894 ...
        4549
        2894
        -
    
    
        NODE_103_length_4645_cov_
7.86383_ID_205 ...
        fig|1648.160.peg.10
        peg
        NODE_103_length_4645_cov_
7.86383_ID_205_674_1138 ...
        674
        1138
        +
    


    
        function
        aliases
        figfam
        evidence_codes
        nucleotide_sequence
    
    
        Putative hemagglutinin
/hemolysin-related ...
        
        
        
        atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
    
    
        ABC transporter, ATP-
binding protein ...
        
        FIG00744535
        ff
        atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
    
    
        ABC transporter, permease
protein (putative) ...
        
        
        
        atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
    
    
        L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
        
        FIG01308195
        ff
        atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
    
    
        hypothetical protein
        
        
        
        atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
    
    
        COG1242: Predicted Fe-S
oxidoreductase ...
        
        FIG00001867
        isu;An_Fe-S_oxireductase_
coupled_with_a_methyl ...
        atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
    
    
        Site-specific recombinase
XerD ...
        
        
        
        atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
    
    
        DNA polymerase IV (EC
2.7.7.7) ...
        
        FIG00023943
        icw(1);DNA_repair,_bacter
ial ...
        atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
    
    
        DNA repair protein RecN
        
        FIG00000317
        isu;DNA-replication
isu;DNA_repair,_bacte ...
        atgcttactcatttaacgatagata
attttgttttgattcatcata ...
    
    
        Phage protein
        
        
        
        atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
    


    
        aa_sequence
        Calculated_length_nt
        Gene_GC
        Calculated_length_aa
    
    
        MTVNEDVTLYAQWKKLIVVVDTYTV
VYDGNGNTSGNAPIDSHPYEK ...
        492
        0.355691056911
        163
    
    
        MKLEFNQINKFFDEHHVLKDISFEV
NSGQIFGYLGRNGAGKTTSIR ...
        909
        0.324532453245
        302
    
    
        MNKLKTVFKFEFLEMLRKRSVKVTT
LILCIAVLLITSVPTIQSVFM ...
        1236
        0.330906148867
        411
    
    
        MKKLKLALIGPGFLNDIVAQAWVDG
YLPEYELVGVLGRNPIRTAAF ...
        786
        0.40203562341
        261
    
    
        MENHVIIYQTKEPKLHTLDACIQDF
TKRYKLDNTWYLYKTANGKPI ...
        540
        0.322222222222
        179
    
    
        MILKNPDLDKQIEHSENIMLQKWPN
AAKIAYFQAYSNTHDSLKNLK ...
        756
        0.382275132275
        251
    
    
        MKNKANDFIHYLQFIDPKAALTVAS
YRNDLNQYIEYLEHESVSSLE ...
        915
        0.343169398907
        304
    
    
        MAQVIFHIDINAFYASAHLITDSSL
YGKPVVVCSNQRGSVVTTASY ...
        1179
        0.367260390161
        392
    
    
        MLTHLTIDNFVLIHHISVDFSDRFN
VFTGETGAGKSLLVDALNFVS ...
        1656
        0.362922705314
        551
    
    
        MSNKRGKYVKWLEPKNLLLLESWAR
KISNEQIAANIGIAPKTLYAW ...
        465
        0.406451612903
        154
    

[1575 rows x 16 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Separating RNA records



In [25]:

    
rna_Banks969_Li_SFrame=Banks969_Li_SFrame.filter_by('rna','type')
rna_Banks969_Li_SFrame









    Out[25]:





    
        contig_id
        feature_id
        type
        location
        start
        stop
        strand
    
    
        NODE_11_length_9738_cov_8
.40958_ID_21 ...
        fig|1648.160.rna.1
        rna
        NODE_11_length_9738_cov_8
.40958_ID_21_9726_9656 ...
        9726
        9656
        -
    
    
        NODE_155_length_3642_cov_
7.81071_ID_309 ...
        fig|1648.160.rna.2
        rna
        NODE_155_length_3642_cov_
7.81071_ID_309_3325_3395 ...
        3325
        3395
        +
    
    
        NODE_160_length_3510_cov_
8.02605_ID_319 ...
        fig|1648.160.rna.3
        rna
        NODE_160_length_3510_cov_
8.02605_ID_319_2002_1930 ...
        2002
        1930
        -
    
    
        NODE_183_length_3178_cov_
10.617_ID_365 ...
        fig|1648.160.rna.4
        rna
        NODE_183_length_3178_cov_
10.617_ID_365_2494_2422 ...
        2494
        2422
        -
    
    
        NODE_183_length_3178_cov_
10.617_ID_365 ...
        fig|1648.160.rna.5
        rna
        NODE_183_length_3178_cov_
10.617_ID_365_2577_2505 ...
        2577
        2505
        -
    
    
        NODE_183_length_3178_cov_
10.617_ID_365 ...
        fig|1648.160.rna.6
        rna
        NODE_183_length_3178_cov_
10.617_ID_365_2827_2755 ...
        2827
        2755
        -
    
    
        NODE_183_length_3178_cov_
10.617_ID_365 ...
        fig|1648.160.rna.7
        rna
        NODE_183_length_3178_cov_
10.617_ID_365_2910_2838 ...
        2910
        2838
        -
    
    
        NODE_184_length_3174_cov_
6.24174_ID_367 ...
        fig|1648.160.rna.8
        rna
        NODE_184_length_3174_cov_
6.24174_ID_367_3041_2969 ...
        3041
        2969
        -
    
    
        NODE_197_length_2970_cov_
11.1692_ID_393 ...
        fig|1648.160.rna.9
        rna
        NODE_197_length_2970_cov_
11.1692_ID_393_1043_971 ...
        1043
        971
        -
    
    
        NODE_197_length_2970_cov_
11.1692_ID_393 ...
        fig|1648.160.rna.10
        rna
        NODE_197_length_2970_cov_
11.1692_ID_393_1143_1070 ...
        1143
        1070
        -
    


    
        function
        aliases
        figfam
        evidence_codes
        nucleotide_sequence
        aa_sequence
        Calculated_length_nt
    
    
        tRNA-Thr-GGT
        
        
        
        gccgccatagttcaatggtaaaata
gagcaatggtaatgctcagtt ...
        
        71
    
    
        tRNA-Trp-CCA
        
        
        
        aggggtatagttcaatggtagagca
acggtctccaaaaccgttaat ...
        
        71
    
    
        tRNA-Thr-CGT
        
        
        
        gccggcttagctcaattggtagagc
aactgaatcgtaatcagtagg ...
        
        73
    
    
        tRNA-Glu-TTC
        
        
        
        ggcccgttggagaaacggttaactc
acatgcctttcacgcatgcat ...
        
        73
    
    
        tRNA-Asn-GTT
        
        
        
        gcttgtttagctcagtcggtagagc
aactggctgttaaccagtggg ...
        
        73
    
    
        tRNA-Glu-TTC
        
        
        
        ggcccgttggagaaacggttaactc
acatgcctttcacgcatgcat ...
        
        73
    
    
        tRNA-Asn-GTT
        
        
        
        gcttgtttagctcagtcggtagagc
aactggctgttaaccagtggg ...
        
        73
    
    
        tRNA-Lys-CTT
        
        
        
        gctccgttagctcagtcggtagagc
aactgactcttaatcagtggg ...
        
        73
    
    
        tRNA-Ala-TGC
        
        
        
        ggggttttagctcagctgggagagc
gcctgccttgcacgcaggagg ...
        
        73
    
    
        tRNA-Ile-GAT
        
        
        
        gggcctgtagctcagttggttagag
cgcacccctgataagggtgag ...
        
        74
    


    
        Gene_GC
        Calculated_length_aa
    
    
        0.492957746479
        0
    
    
        0.507042253521
        0
    
    
        0.520547945205
        0
    
    
        0.547945205479
        0
    
    
        0.547945205479
        0
    
    
        0.547945205479
        0
    
    
        0.547945205479
        0
    
    
        0.575342465753
        0
    
    
        0.602739726027
        0
    
    
        0.581081081081
        0
    

[55 rows x 16 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Separating coding records



In [26]:

    
coding_Banks969_Li_SFrame=Banks969_Li_SFrame.filter_by('peg','type')
coding_Banks969_Li_SFrame.show()









    



Canvas is accessible via web browser at the URL: http://localhost:64100/index.html
Opening Canvas in default web browser.

Exporting Data



In [30]:

    
gene_Banks969_Li_gc_SFrame=coding_Banks969_Li_SFrame.select_columns(['feature_id','Calculated_length_nt','Gene_GC'])
gene_Banks969_Li_gc_SFrame









    Out[30]:





    
        feature_id
        Calculated_length_nt
        Gene_GC
    
    
        fig|1648.160.peg.1
        492
        0.355691056911
    
    
        fig|1648.160.peg.2
        909
        0.324532453245
    
    
        fig|1648.160.peg.3
        1236
        0.330906148867
    
    
        fig|1648.160.peg.4
        786
        0.40203562341
    
    
        fig|1648.160.peg.5
        540
        0.322222222222
    
    
        fig|1648.160.peg.6
        756
        0.382275132275
    
    
        fig|1648.160.peg.7
        915
        0.343169398907
    
    
        fig|1648.160.peg.8
        1179
        0.367260390161
    
    
        fig|1648.160.peg.9
        1656
        0.362922705314
    
    
        fig|1648.160.peg.10
        465
        0.406451612903
    

[1520 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exporting to csv



In [31]:

    
gene_Banks969_Li_gc_SFrame.export_csv('gene_Banks969_Li_gc.tsv','\t')

Exporting to dictionary/fastA file



In [33]:

    
gene_Banks969_Li_gc_SFrame[0]









    Out[33]:





{'Calculated_length_nt': 492L,
 'Gene_GC': 0.3556910569105691,
 'feature_id': 'fig|1648.160.peg.1'}

Forming a record of the following pattern: ">ID| function| [Strain]



In [56]:

    
coding_Banks969_Li_aa_dict={}
for item in coding_Banks969_Li_SFrame:
    coding_Banks969_Li_aa_dict['>'+item['feature_id']+'| '+item['function']+'| '+'[Erysipelothrix rhusiopathiae Banks969-Li]']=item['aa_sequence']
    #print item['function']

for i in range(len(coding_Banks969_Li_aa_dict)):
    print coding_Banks969_Li_aa_dict[][i]









    



  File "<ipython-input-56-aa03a4ea185d>", line 7
    print coding_Banks969_Li_aa_dict[][i]
                                     ^
SyntaxError: invalid syntax

Writing to a faa file



In [60]:

    
out_file='Banks969_Li_annotated.faa'
hand_out=open(out_file,'w')

for item in coding_Banks969_Li_aa_dict:
   hand_out.write(item+'\n'+coding_Banks969_Li_aa_dict[item])
hand_out.close()
print len(coding_Banks969_Li_aa_dict), type(coding_Banks969_Li_aa_dict)









    



1520 <type 'dict'>



In [61]:

    
out_file='Banks969_Li_annotated_1.faa'
hand_out=open(out_file,'w')
coding_Banks969_Li_aa_headers=coding_Banks969_Li_aa_dict.keys()
coding_Banks969_Li_aa_sequences=coding_Banks969_Li_aa_dict.values()
for i in range(len(coding_Banks969_Li_aa_dict)):
    hand_out.write(coding_Banks969_Li_aa_headers[i]+'\n'+coding_Banks969_Li_aa_sequences[i])
hand_out.close()
print len(coding_Banks969_Li_aa_dict), type(coding_Banks969_Li_aa_dict)
print len(coding_Banks969_Li_aa_headers), type(coding_Banks969_Li_aa_headers)
print len(coding_Banks969_Li_aa_sequences), type(coding_Banks969_Li_aa_sequences)









    



1520 <type 'dict'>
1520 <type 'list'>
1520 <type 'list'>

Correcting for the newline characters between the entries but not after the last line



In [65]:

    
out_file='Banks969_Li_annotated_optimized.faa'
hand_out=open(out_file,'w')
coding_Banks969_Li_aa_headers=coding_Banks969_Li_aa_dict.keys()
coding_Banks969_Li_aa_sequences=coding_Banks969_Li_aa_dict.values()
for i in range(len(coding_Banks969_Li_aa_dict)):
    if (i<(len(coding_Banks969_Li_aa_dict)-1)):
        hand_out.write(coding_Banks969_Li_aa_headers[i]+'\n'+coding_Banks969_Li_aa_sequences[i]+'\n')
    else:
        hand_out.write(coding_Banks969_Li_aa_headers[i]+'\n'+coding_Banks969_Li_aa_sequences[i])
hand_out.close()
print len(coding_Banks969_Li_aa_dict), type(coding_Banks969_Li_aa_dict)
print len(coding_Banks969_Li_aa_headers), type(coding_Banks969_Li_aa_headers)
print len(coding_Banks969_Li_aa_sequences), type(coding_Banks969_Li_aa_sequences)









    



1520 <type 'dict'>
1520 <type 'list'>
1520 <type 'list'>



In [ ]:

Create a dictionary containing the data



In [31]:

    
dict_nt={}

for item in lst_nt:
    lst_item=item.split('\n')
    dict_nt[lst_item[0].strip()]=''.join(lst_item[1:]).strip()
#dict_nt ##looks ok but too large to display



In [33]:

    
dict_aa={}
for item in lst_aa:
    lst_item=item.split('\n')
    dict_aa[lst_item[0].strip()]=''.join(lst_item[1:]).strip()
#dict_aa ###too large to display

Creating index for table1 to insert nucleotides: indexed_df = df.set_index(['A', 'B'])



In [13]:

    
#table1_addedColumn=table1
#table1_addedColumn['Seq_nt']=table1_addedColumn['NCBI GI']#just creating an empty column
#indexed_table1=table1.to_dataframe()#converting to Dataframe and indexing
#indexed_table1.set_index('Feature ID')



In [48]:

    
sf=graphlab.SFrame({'Feature ID':dict_nt.keys(),'Seq_nt':dict_nt.values()})
#sf['Feature ID']
aa_SFrame=graphlab.SFrame({'Feature ID':dict_aa.keys(),'Seq_aa':dict_aa.values()})
aa_SFrame









    Out[48]:





    
        Feature ID
        Seq_aa
    
    
        fig|1648.160.peg.1398
        MKLTDKQLKIIETYGADTPESFLMV
FPYRYERLEAKPRDAWRPGDT ...
    
    
        fig|1648.160.peg.1399
        MTKWDLLDSMKIPYFNKDLINNAFI
HSSYVNEAEELLEDNERLEFM ...
    
    
        fig|1648.160.peg.1428
        MVIHFDLAQKEAKIKELESEMVEDG
FWDDHRHSSQHIQKLNQLKKV ...
    
    
        fig|1648.160.peg.1429
        MAGFLSNLFSGDRKILNEIEKIAHE
IDALADETRALSDEALKEKTN ...
    
    
        fig|1648.160.peg.1394
        MPKLDILNLEGKSLRELELNEAVFG
IEPNNQTIFEAVVMQQASLRQ ...
    
    
        fig|1648.160.peg.1395
        MKGLLGRKLGMTQVFTTDGKLIPVS
VVEVLPNVVLQKKTMESDNYE ...
    
    
        fig|1648.160.peg.1396
        MAKNFIRIRLKAYEHRTIDSAAQKI
VQAANDHGAQKVVGPVPLPTE ...
    
    
        fig|1648.160.peg.1397
        MAAKQAQDICEDKDIHVIETKTIPQ
GLSACVMFNPDVDVEDNLTEM ...
    
    
        fig|1648.160.peg.1422
        MVPYESTVFTNYQTIARLGLLNTYT
ALIVPSLASVFYIFYLKGYLT ...
    
    
        fig|1648.160.peg.1391
        MSRSLKKGPFCDLHLMNKVEKLNAE
SKKEVIKTWSRRSTIFPQFVE ...
    

[1520 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Adding a column and inserting nucleotide data



In [49]:

    
table1_merge=table1
#table1_merge['Feature ID']



In [50]:

    
sf









    Out[50]:





    
        Feature ID
        Seq_nt
    
    
        fig|1648.160.peg.1398
        atgaaactcacggataaacaattaa
aaattattgaaacttatggtg ...
    
    
        fig|1648.160.peg.1399
        atgacaaaatgggatttattggatt
ctatgaagattccatacttta ...
    
    
        fig|1648.160.peg.1428
        ttggtgattcactttgacttagctc
aaaaagaagctaaaatcaaag ...
    
    
        fig|1648.160.peg.1429
        atggctggatttttaagtaatcttt
ttagtggcgatcgtaagattt ...
    
    
        fig|1648.160.peg.1394
        atgcctaagttagatattttaaacc
ttgaaggcaagagtcttcgtg ...
    
    
        fig|1648.160.peg.1395
        atgaaaggattactaggacgtaaat
taggaatgacacaagtcttca ...
    
    
        fig|1648.160.peg.1396
        atggcaaagaattttattcgtattc
gtttaaaagcttatgaacacc ...
    
    
        fig|1648.160.peg.1397
        atggctgcaaaacaagctcaagata
tttgtgaagacaaagatatac ...
    
    
        fig|1648.160.peg.1390
        atggaagtaaaagctatagttaaaa
cagttcgtgttactccacgta ...
    
    
        fig|1648.160.peg.1391
        atgagtcgtagtttaaaaaaaggac
cattttgtgatcttcatttaa ...
    

[1520 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Trying to merge



In [51]:

    
mergedTbl1=table1.join(sf, how='left')
mergedTbl1.show()









    



Canvas is updated and available in a tab in the default browser.

Defining user functions



In [52]:



In [53]:

    
mergedTbl1["Calculated_length_nt"]=mergedTbl1['Seq_nt'].apply(returnLen)
mergedTbl1









    Out[53]:





    
        Feature ID
        Type
        Contig
        Start
        Stop
        Frame
        Strand
        Length (bp)
    
    
        fig|1648.160.peg.1
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        39
        530
        3
        +
        492
    
    
        fig|1648.160.peg.2
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        693
        1601
        3
        +
        909
    
    
        fig|1648.160.peg.3
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        1594
        2829
        1
        +
        1236
    
    
        fig|1648.160.peg.4
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        3666
        2881
        -3
        -
        786
    
    
        fig|1648.160.peg.5
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        4297
        3758
        -1
        -
        540
    
    
        fig|1648.160.peg.6
        CDS
        NODE_101_length_4672_cov_
9.29831_ID_201 ...
        4524
        3769
        -3
        -
        756
    
    
        fig|1648.160.peg.7
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        1744
        830
        -1
        -
        915
    
    
        fig|1648.160.peg.8
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        2906
        1728
        -2
        -
        1179
    
    
        fig|1648.160.peg.9
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        4549
        2894
        -1
        -
        1656
    
    
        fig|1648.160.peg.10
        CDS
        NODE_103_length_4645_cov_
7.86383_ID_205 ...
        674
        1138
        2
        +
        465
    


    
        Function
        Subsystem
        NCBI GI
        locus
        Seq_nt
        Calculated_length_nt
    
    
        Putative hemagglutinin
/hemolysin-related ...
        - none -
        
        
        atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
        492
    
    
        ABC transporter, ATP-
binding protein ...
        - none -
        
        
        atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
        909
    
    
        ABC transporter, permease
protein (putative) ...
        - none -
        
        
        atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
        1236
    
    
        L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
        - none -
        
        
        atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
        786
    
    
        hypothetical protein
        - none -
        
        
        atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
        540
    
    
        COG1242: Predicted Fe-S
oxidoreductase ...
        An Fe-S oxireductase
coupled with a ...
        
        
        atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
        756
    
    
        Site-specific recombinase
XerD ...
        - none -
        
        
        atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
        915
    
    
        DNA polymerase IV (EC
2.7.7.7) ...
        DNA repair, bacterial
        
        
        atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
        1179
    
    
        DNA repair protein RecN
        DNA-replication; <br>DNA
repair, bacterial ...
        
        
        atgcttactcatttaacgatagata
attttgttttgattcatcata ...
        1656
    
    
        Phage protein
        - none -
        
        
        atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
        465
    

[1575 rows x 14 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [ ]:

Inserting GC content data for every gene



In [54]:

    
mergedTbl1["Gene_GC"]=mergedTbl1['Seq_nt'].apply(gc)
mergedTbl1









    Out[54]:





    
        Feature ID
        Type
        Contig
        Start
        Stop
        Frame
        Strand
        Length (bp)
    
    
        fig|1648.160.peg.1
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        39
        530
        3
        +
        492
    
    
        fig|1648.160.peg.2
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        693
        1601
        3
        +
        909
    
    
        fig|1648.160.peg.3
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        1594
        2829
        1
        +
        1236
    
    
        fig|1648.160.peg.4
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        3666
        2881
        -3
        -
        786
    
    
        fig|1648.160.peg.5
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        4297
        3758
        -1
        -
        540
    
    
        fig|1648.160.peg.6
        CDS
        NODE_101_length_4672_cov_
9.29831_ID_201 ...
        4524
        3769
        -3
        -
        756
    
    
        fig|1648.160.peg.7
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        1744
        830
        -1
        -
        915
    
    
        fig|1648.160.peg.8
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        2906
        1728
        -2
        -
        1179
    
    
        fig|1648.160.peg.9
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        4549
        2894
        -1
        -
        1656
    
    
        fig|1648.160.peg.10
        CDS
        NODE_103_length_4645_cov_
7.86383_ID_205 ...
        674
        1138
        2
        +
        465
    


    
        Function
        Subsystem
        NCBI GI
        locus
        Seq_nt
        Calculated_length_nt
    
    
        Putative hemagglutinin
/hemolysin-related ...
        - none -
        
        
        atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
        492
    
    
        ABC transporter, ATP-
binding protein ...
        - none -
        
        
        atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
        909
    
    
        ABC transporter, permease
protein (putative) ...
        - none -
        
        
        atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
        1236
    
    
        L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
        - none -
        
        
        atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
        786
    
    
        hypothetical protein
        - none -
        
        
        atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
        540
    
    
        COG1242: Predicted Fe-S
oxidoreductase ...
        An Fe-S oxireductase
coupled with a ...
        
        
        atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
        756
    
    
        Site-specific recombinase
XerD ...
        - none -
        
        
        atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
        915
    
    
        DNA polymerase IV (EC
2.7.7.7) ...
        DNA repair, bacterial
        
        
        atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
        1179
    
    
        DNA repair protein RecN
        DNA-replication; <br>DNA
repair, bacterial ...
        
        
        atgcttactcatttaacgatagata
attttgttttgattcatcata ...
        1656
    
    
        Phage protein
        - none -
        
        
        atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
        465
    


    
        Gene_GC
    
    
        0.355691056911
    
    
        0.324532453245
    
    
        0.330906148867
    
    
        0.40203562341
    
    
        0.322222222222
    
    
        0.382275132275
    
    
        0.343169398907
    
    
        0.367260390161
    
    
        0.362922705314
    
    
        0.406451612903
    

[1575 rows x 15 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Adding Amino Acid records

Merging two SFrames



In [55]:

    
mergedTbl1=mergedTbl1.join(aa_SFrame, how='left')
mergedTbl1 #OK









    Out[55]:





    
        Feature ID
        Type
        Contig
        Start
        Stop
        Frame
        Strand
        Length (bp)
    
    
        fig|1648.160.peg.1
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        39
        530
        3
        +
        492
    
    
        fig|1648.160.peg.2
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        693
        1601
        3
        +
        909
    
    
        fig|1648.160.peg.3
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        1594
        2829
        1
        +
        1236
    
    
        fig|1648.160.peg.4
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        3666
        2881
        -3
        -
        786
    
    
        fig|1648.160.peg.5
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        4297
        3758
        -1
        -
        540
    
    
        fig|1648.160.peg.6
        CDS
        NODE_101_length_4672_cov_
9.29831_ID_201 ...
        4524
        3769
        -3
        -
        756
    
    
        fig|1648.160.peg.7
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        1744
        830
        -1
        -
        915
    
    
        fig|1648.160.peg.8
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        2906
        1728
        -2
        -
        1179
    
    
        fig|1648.160.peg.9
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        4549
        2894
        -1
        -
        1656
    
    
        fig|1648.160.peg.10
        CDS
        NODE_103_length_4645_cov_
7.86383_ID_205 ...
        674
        1138
        2
        +
        465
    


    
        Function
        Subsystem
        NCBI GI
        locus
        Seq_nt
        Calculated_length_nt
    
    
        Putative hemagglutinin
/hemolysin-related ...
        - none -
        
        
        atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
        492
    
    
        ABC transporter, ATP-
binding protein ...
        - none -
        
        
        atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
        909
    
    
        ABC transporter, permease
protein (putative) ...
        - none -
        
        
        atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
        1236
    
    
        L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
        - none -
        
        
        atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
        786
    
    
        hypothetical protein
        - none -
        
        
        atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
        540
    
    
        COG1242: Predicted Fe-S
oxidoreductase ...
        An Fe-S oxireductase
coupled with a ...
        
        
        atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
        756
    
    
        Site-specific recombinase
XerD ...
        - none -
        
        
        atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
        915
    
    
        DNA polymerase IV (EC
2.7.7.7) ...
        DNA repair, bacterial
        
        
        atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
        1179
    
    
        DNA repair protein RecN
        DNA-replication; <br>DNA
repair, bacterial ...
        
        
        atgcttactcatttaacgatagata
attttgttttgattcatcata ...
        1656
    
    
        Phage protein
        - none -
        
        
        atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
        465
    


    
        Gene_GC
        Seq_aa
    
    
        0.355691056911
        MTVNEDVTLYAQWKKLIVVVDTYTV
VYDGNGNTSGNAPIDSHPYEK ...
    
    
        0.324532453245
        MKLEFNQINKFFDEHHVLKDISFEV
NSGQIFGYLGRNGAGKTTSIR ...
    
    
        0.330906148867
        MNKLKTVFKFEFLEMLRKRSVKVTT
LILCIAVLLITSVPTIQSVFM ...
    
    
        0.40203562341
        MKKLKLALIGPGFLNDIVAQAWVDG
YLPEYELVGVLGRNPIRTAAF ...
    
    
        0.322222222222
        MENHVIIYQTKEPKLHTLDACIQDF
TKRYKLDNTWYLYKTANGKPI ...
    
    
        0.382275132275
        MILKNPDLDKQIEHSENIMLQKWPN
AAKIAYFQAYSNTHDSLKNLK ...
    
    
        0.343169398907
        MKNKANDFIHYLQFIDPKAALTVAS
YRNDLNQYIEYLEHESVSSLE ...
    
    
        0.367260390161
        MAQVIFHIDINAFYASAHLITDSSL
YGKPVVVCSNQRGSVVTTASY ...
    
    
        0.362922705314
        MLTHLTIDNFVLIHHISVDFSDRFN
VFTGETGAGKSLLVDALNFVS ...
    
    
        0.406451612903
        MSNKRGKYVKWLEPKNLLLLESWAR
KISNEQIAANIGIAPKTLYAW ...
    

[1575 rows x 16 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Adding aa length column



In [56]:

    
mergedTbl1["Calculated_length_aa"]=mergedTbl1['Seq_aa'].apply(returnLen)
mergedTbl1









    Out[56]:





    
        Feature ID
        Type
        Contig
        Start
        Stop
        Frame
        Strand
        Length (bp)
    
    
        fig|1648.160.peg.1
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        39
        530
        3
        +
        492
    
    
        fig|1648.160.peg.2
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        693
        1601
        3
        +
        909
    
    
        fig|1648.160.peg.3
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        1594
        2829
        1
        +
        1236
    
    
        fig|1648.160.peg.4
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        3666
        2881
        -3
        -
        786
    
    
        fig|1648.160.peg.5
        CDS
        NODE_100_length_4674_cov_
7.14051_ID_199 ...
        4297
        3758
        -1
        -
        540
    
    
        fig|1648.160.peg.6
        CDS
        NODE_101_length_4672_cov_
9.29831_ID_201 ...
        4524
        3769
        -3
        -
        756
    
    
        fig|1648.160.peg.7
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        1744
        830
        -1
        -
        915
    
    
        fig|1648.160.peg.8
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        2906
        1728
        -2
        -
        1179
    
    
        fig|1648.160.peg.9
        CDS
        NODE_102_length_4655_cov_
7.745_ID_203 ...
        4549
        2894
        -1
        -
        1656
    
    
        fig|1648.160.peg.10
        CDS
        NODE_103_length_4645_cov_
7.86383_ID_205 ...
        674
        1138
        2
        +
        465
    


    
        Function
        Subsystem
        NCBI GI
        locus
        Seq_nt
        Calculated_length_nt
    
    
        Putative hemagglutinin
/hemolysin-related ...
        - none -
        
        
        atgactgttaacgaggatgttacac
tttatgcacaatggaaaaaat ...
        492
    
    
        ABC transporter, ATP-
binding protein ...
        - none -
        
        
        atgaaattagaatttaatcaaatca
ataagttttttgacgaacatc ...
        909
    
    
        ABC transporter, permease
protein (putative) ...
        - none -
        
        
        atgaataaactcaaaacggttttta
aatttgaatttttagaaatgt ...
        1236
    
    
        L-Aspartate dehydrogenase
(EC 1.4.1.21) homolog ...
        - none -
        
        
        atgaaaaaattaaaattagcgttaa
ttgggccaggatttctcaacg ...
        786
    
    
        hypothetical protein
        - none -
        
        
        atggaaaatcacgtaattatctatc
aaacaaaagagcccaaacttc ...
        540
    
    
        COG1242: Predicted Fe-S
oxidoreductase ...
        An Fe-S oxireductase
coupled with a ...
        
        
        atgattttaaaaaatccagacttag
ataagcaaatcgagcacagtg ...
        756
    
    
        Site-specific recombinase
XerD ...
        - none -
        
        
        atgaaaaacaaagctaatgatttta
ttcactacctccaatttattg ...
        915
    
    
        DNA polymerase IV (EC
2.7.7.7) ...
        DNA repair, bacterial
        
        
        atggcacaagtaatatttcatatag
atattaacgcgttttatgcaa ...
        1179
    
    
        DNA repair protein RecN
        DNA-replication; <br>DNA
repair, bacterial ...
        
        
        atgcttactcatttaacgatagata
attttgttttgattcatcata ...
        1656
    
    
        Phage protein
        - none -
        
        
        atgtcaaataagagaggtaaatacg
taaaatggcttgaacctaaaa ...
        465
    


    
        Gene_GC
        Seq_aa
        Calculated_length_aa
    
    
        0.355691056911
        MTVNEDVTLYAQWKKLIVVVDTYTV
VYDGNGNTSGNAPIDSHPYEK ...
        163
    
    
        0.324532453245
        MKLEFNQINKFFDEHHVLKDISFEV
NSGQIFGYLGRNGAGKTTSIR ...
        302
    
    
        0.330906148867
        MNKLKTVFKFEFLEMLRKRSVKVTT
LILCIAVLLITSVPTIQSVFM ...
        411
    
    
        0.40203562341
        MKKLKLALIGPGFLNDIVAQAWVDG
YLPEYELVGVLGRNPIRTAAF ...
        261
    
    
        0.322222222222
        MENHVIIYQTKEPKLHTLDACIQDF
TKRYKLDNTWYLYKTANGKPI ...
        179
    
    
        0.382275132275
        MILKNPDLDKQIEHSENIMLQKWPN
AAKIAYFQAYSNTHDSLKNLK ...
        251
    
    
        0.343169398907
        MKNKANDFIHYLQFIDPKAALTVAS
YRNDLNQYIEYLEHESVSSLE ...
        304
    
    
        0.367260390161
        MAQVIFHIDINAFYASAHLITDSSL
YGKPVVVCSNQRGSVVTTASY ...
        392
    
    
        0.362922705314
        MLTHLTIDNFVLIHHISVDFSDRFN
VFTGETGAGKSLLVDALNFVS ...
        551
    
    
        0.406451612903
        MSNKRGKYVKWLEPKNLLLLESWAR
KISNEQIAANIGIAPKTLYAW ...
        154
    

[1575 rows x 17 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [ ]:



In [ ]:



In [58]:

    
graphlab.canvas.set_target('ipynb')
mergedTbl1['Gene_GC'].show()



In [ ]:



In [59]:

    
from Bio import SeqIO









    



---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-59-8f71aa6c445c> in <module>()
----> 1 from Bio import SeqIO

ImportError: No module named Bio



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [5]:

    
sf.head()









    Out[5]:





    
        First Name
        Last Name
        Country
        age
    
    
        Bob
        Smith
        United States
        24
    
    
        Alice
        Williams
        Canada
        23
    
    
        Malcolm
        Jone
        England
        22
    
    
        Felix
        Brown
        USA
        23
    
    
        Alex
        Cooper
        Poland
        23
    
    
        Tod
        Campbell
        United States
        22
    
    
        Derek
        Ward
        Switzerland
        25
    

[7 rows x 4 columns]



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [6]:

    
sf.tail()









    Out[6]:





    
        First Name
        Last Name
        Country
        age
    
    
        Bob
        Smith
        United States
        24
    
    
        Alice
        Williams
        Canada
        23
    
    
        Malcolm
        Jone
        England
        22
    
    
        Felix
        Brown
        USA
        23
    
    
        Alex
        Cooper
        Poland
        23
    
    
        Tod
        Campbell
        United States
        22
    
    
        Derek
        Ward
        Switzerland
        25
    

[7 rows x 4 columns]

GraphLab Canvas



In [7]:

    
#Any data structure in GraphLab Create
sf.show()
table1.show()









    



Canvas is accessible via web browser at the URL: http://localhost:60305/index.html
Opening Canvas in default web browser.
Canvas is accessible via web browser at the URL: http://localhost:60305/index.html
Opening Canvas in default web browser.



In [8]:

    
graphlab.canvas.set_target('ipynb')



In [11]:

    
sf['age'].show(view="Categorical")

Inspect some columns of dataset



In [9]:

    
sf["Country"]
table1["Length (bp)"]









    Out[9]:





dtype: int
Rows: 1575
[492L, 909L, 1236L, 786L, 540L, 756L, 915L, 1179L, 1656L, 465L, 1374L, 1329L, 1029L, 819L, 603L, 534L, 651L, 2373L, 969L, 324L, 672L, 1323L, 150L, 450L, 1089L, 1047L, 1146L, 480L, 708L, 309L, 792L, 2433L, 306L, 273L, 249L, 498L, 723L, 1305L, 384L, 1395L, 1368L, 1182L, 1533L, 732L, 867L, 645L, 1011L, 1092L, 813L, 360L, 1092L, 903L, 681L, 123L, 804L, 615L, 399L, 135L, 1212L, 1851L, 1044L, 1281L, 900L, 1566L, 1110L, 537L, 1644L, 114L, 207L, 435L, 330L, 723L, 1155L, 837L, 789L, 1071L, 537L, 729L, 693L, 1539L, 159L, 192L, 330L, 657L, 1335L, 726L, 564L, 597L, 615L, 756L, 549L, 759L, 621L, 489L, 564L, 570L, 4092L, 729L, 999L, 996L, ... ]



In [13]:

    
sf['age']









    Out[13]:





dtype: int
Rows: 7
[24L, 23L, 22L, 23L, 23L, 22L, 25L]



In [14]:

    
sf['age'].mean()









    Out[14]:





23.142857142857146



In [15]:

    
sf['age'].max()









    Out[15]:





25L

Create new columns in our SFrame



In [10]:

    
sf["Full Name"]=sf['First Name']+' '+sf['Last Name']









    



---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-7a7c46285d97> in <module>()
      1 sf["Full Name"]=sf['First Name']+' '+sf['Last Name']
----> 2 table1["Seq_nt"]

C:\Anaconda\lib\site-packages\graphlab\data_structures\sframe.pyc in __getitem__(self, key)
   3536             return self._row_selector(key)
   3537         elif type(key) is str:
-> 3538             return self.select_column(key)
   3539         elif type(key) is type:
   3540             return self.select_columns([key])

C:\Anaconda\lib\site-packages\graphlab\data_structures\sframe.pyc in select_column(self, key)
   3141             raise TypeError("Invalid key type: must be str")
   3142         with cython_context():
-> 3143             return SArray(data=[], _proxy=self.__proxy__.select_column(key))
   3144 
   3145     def select_columns(self, keylist):

C:\Anaconda\lib\site-packages\graphlab\cython\context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Column name Seq_nt does not exist.



In [18]:

    
sf









    Out[18]:





    
        First Name
        Last Name
        Country
        age
        Full Name
    
    
        Bob
        Smith
        United States
        24
        Bob Smith
    
    
        Alice
        Williams
        Canada
        23
        Alice Williams
    
    
        Malcolm
        Jone
        England
        22
        Malcolm Jone
    
    
        Felix
        Brown
        USA
        23
        Felix Brown
    
    
        Alex
        Cooper
        Poland
        23
        Alex Cooper
    
    
        Tod
        Campbell
        United States
        22
        Tod Campbell
    
    
        Derek
        Ward
        Switzerland
        25
        Derek Ward
    

[7 rows x 5 columns]



In [20]:

    
sf['age']+2









    Out[20]:





dtype: int
Rows: 7
[26L, 25L, 24L, 25L, 25L, 24L, 27L]



In [21]:

    
sf['age']=sf['age']+2



In [22]:

    
sf









    Out[22]:





    
        First Name
        Last Name
        Country
        age
        Full Name
    
    
        Bob
        Smith
        United States
        26
        Bob Smith
    
    
        Alice
        Williams
        Canada
        25
        Alice Williams
    
    
        Malcolm
        Jone
        England
        24
        Malcolm Jone
    
    
        Felix
        Brown
        USA
        25
        Felix Brown
    
    
        Alex
        Cooper
        Poland
        25
        Alex Cooper
    
    
        Tod
        Campbell
        United States
        24
        Tod Campbell
    
    
        Derek
        Ward
        Switzerland
        27
        Derek Ward
    

[7 rows x 5 columns]

Use the apply function to do an advanced transformation of our data



In [23]:

    
sfpp['Country']









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-23-af42b1b35ea2> in <module>()
----> 1 sfpp['Country']

NameError: name 'sfpp' is not defined



In [24]:

    
sf['Country']









    Out[24]:





dtype: str
Rows: 7
['United States', 'Canada', 'England', 'USA', 'Poland', 'United States', 'Switzerland']



In [25]:

    
sf['Country'].show()



In [26]:

    
def transform_country(country):
    if country == 'USA':
        return 'United States'
    else:
        return country



In [27]:

    
transform_country('Brazil')









    Out[27]:





'Brazil'



In [28]:

    
transform_country('USA')









    Out[28]:





'United States'



In [29]:

    
sf['Country'].apply(transform.country)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-ebb4bbbf4ccf> in <module>()
----> 1 sf['Country'].apply(transform.country)

NameError: name 'transform' is not defined



In [30]:

    
sf['Country'].apply(transform_country)









    Out[30]:





dtype: str
Rows: 7
['United States', 'Canada', 'England', 'United States', 'Poland', 'United States', 'Switzerland']



In [31]:

    
sf['Country']=sf['Country'].apply(transform_country)



In [32]:

    
sf









    Out[32]:





    
        First Name
        Last Name
        Country
        age
        Full Name
    
    
        Bob
        Smith
        United States
        26
        Bob Smith
    
    
        Alice
        Williams
        Canada
        25
        Alice Williams
    
    
        Malcolm
        Jone
        England
        24
        Malcolm Jone
    
    
        Felix
        Brown
        United States
        25
        Felix Brown
    
    
        Alex
        Cooper
        Poland
        25
        Alex Cooper
    
    
        Tod
        Campbell
        United States
        24
        Tod Campbell
    
    
        Derek
        Ward
        Switzerland
        27
        Derek Ward
    

[7 rows x 5 columns]



In [ ]:

contig_id	feature_id	type	location	start	stop	strand
NODE_100_length_4674_cov_ 7.14051_ID_199 ...	fig\|1648.160.peg.1	peg	NODE_100_length_4674_cov_ 7.14051_ID_199_39_530 ...	39	530	+
NODE_100_length_4674_cov_ 7.14051_ID_199 ...	fig\|1648.160.peg.2	peg	NODE_100_length_4674_cov_ 7.14051_ID_199_693_1601 ...	693	1601	+
NODE_100_length_4674_cov_ 7.14051_ID_199 ...	fig\|1648.160.peg.3	peg	NODE_100_length_4674_cov_ 7.14051_ID_199_1594_2829 ...	1594	2829	+
NODE_100_length_4674_cov_ 7.14051_ID_199 ...	fig\|1648.160.peg.4	peg	NODE_100_length_4674_cov_ 7.14051_ID_199_3666_2881 ...	3666	2881	-
NODE_100_length_4674_cov_ 7.14051_ID_199 ...	fig\|1648.160.peg.5	peg	NODE_100_length_4674_cov_ 7.14051_ID_199_4297_3758 ...	4297	3758	-
NODE_101_length_4672_cov_ 9.29831_ID_201 ...	fig\|1648.160.peg.6	peg	NODE_101_length_4672_cov_ 9.29831_ID_201_4524_3769 ...	4524	3769	-
NODE_102_length_4655_cov_ 7.745_ID_203 ...	fig\|1648.160.peg.7	peg	NODE_102_length_4655_cov_ 7.745_ID_203_1744_830 ...	1744	830	-
NODE_102_length_4655_cov_ 7.745_ID_203 ...	fig\|1648.160.peg.8	peg	NODE_102_length_4655_cov_ 7.745_ID_203_2906_1728 ...	2906	1728	-
NODE_102_length_4655_cov_ 7.745_ID_203 ...	fig\|1648.160.peg.9	peg	NODE_102_length_4655_cov_ 7.745_ID_203_4549_2894 ...	4549	2894	-
NODE_103_length_4645_cov_ 7.86383_ID_205 ...	fig\|1648.160.peg.10	peg	NODE_103_length_4645_cov_ 7.86383_ID_205_674_1138 ...	674	1138	+

function	figfam	evidence_codes	nucleotide_sequence
Putative hemagglutinin /hemolysin-related ...			atgactgttaacgaggatgttacac tttatgcacaatggaaaaaat ...
ABC transporter, ATP- binding protein ...	FIG00744535	ff	atgaaattagaatttaatcaaatca ataagttttttgacgaacatc ...
ABC transporter, permease protein (putative) ...			atgaataaactcaaaacggttttta aatttgaatttttagaaatgt ...
L-Aspartate dehydrogenase (EC 1.4.1.21) homolog ...	FIG01308195	ff	atgaaaaaattaaaattagcgttaa ttgggccaggatttctcaacg ...
hypothetical protein			atggaaaatcacgtaattatctatc aaacaaaagagcccaaacttc ...
COG1242: Predicted Fe-S oxidoreductase ...	FIG00001867	isu;An_Fe-S_oxireductase_ coupled_with_a_methyl ...	atgattttaaaaaatccagacttag ataagcaaatcgagcacagtg ...
Site-specific recombinase XerD ...			atgaaaaacaaagctaatgatttta ttcactacctccaatttattg ...
DNA polymerase IV (EC 2.7.7.7) ...	FIG00023943	icw(1);DNA_repair,_bacter ial ...	atggcacaagtaatatttcatatag atattaacgcgttttatgcaa ...
DNA repair protein RecN	FIG00000317	isu;DNA-replication isu;DNA_repair,_bacte ...	atgcttactcatttaacgatagata attttgttttgattcatcata ...
Phage protein			atgtcaaataagagaggtaaatacg taaaatggcttgaacctaaaa ...

aa_sequence	Calculated_length_nt	Gene_GC	Calculated_length_aa
MTVNEDVTLYAQWKKLIVVVDTYTV VYDGNGNTSGNAPIDSHPYEK ...	492	0.355691056911	163
MKLEFNQINKFFDEHHVLKDISFEV NSGQIFGYLGRNGAGKTTSIR ...	909	0.324532453245	302
MNKLKTVFKFEFLEMLRKRSVKVTT LILCIAVLLITSVPTIQSVFM ...	1236	0.330906148867	411
MKKLKLALIGPGFLNDIVAQAWVDG YLPEYELVGVLGRNPIRTAAF ...	786	0.40203562341	261
MENHVIIYQTKEPKLHTLDACIQDF TKRYKLDNTWYLYKTANGKPI ...	540	0.322222222222	179
MILKNPDLDKQIEHSENIMLQKWPN AAKIAYFQAYSNTHDSLKNLK ...	756	0.382275132275	251
MKNKANDFIHYLQFIDPKAALTVAS YRNDLNQYIEYLEHESVSSLE ...	915	0.343169398907	304
MAQVIFHIDINAFYASAHLITDSSL YGKPVVVCSNQRGSVVTTASY ...	1179	0.367260390161	392
MLTHLTIDNFVLIHHISVDFSDRFN VFTGETGAGKSLLVDALNFVS ...	1656	0.362922705314	551
MSNKRGKYVKWLEPKNLLLLESWAR KISNEQIAANIGIAPKTLYAW ...	465	0.406451612903	154

contig_id	feature_id	type	location	start	stop	strand
NODE_11_length_9738_cov_8 .40958_ID_21 ...	fig\|1648.160.rna.1	rna	NODE_11_length_9738_cov_8 .40958_ID_21_9726_9656 ...	9726	9656	-
NODE_155_length_3642_cov_ 7.81071_ID_309 ...	fig\|1648.160.rna.2	rna	NODE_155_length_3642_cov_ 7.81071_ID_309_3325_3395 ...	3325	3395	+
NODE_160_length_3510_cov_ 8.02605_ID_319 ...	fig\|1648.160.rna.3	rna	NODE_160_length_3510_cov_ 8.02605_ID_319_2002_1930 ...	2002	1930	-
NODE_183_length_3178_cov_ 10.617_ID_365 ...	fig\|1648.160.rna.4	rna	NODE_183_length_3178_cov_ 10.617_ID_365_2494_2422 ...	2494	2422	-
NODE_183_length_3178_cov_ 10.617_ID_365 ...	fig\|1648.160.rna.5	rna	NODE_183_length_3178_cov_ 10.617_ID_365_2577_2505 ...	2577	2505	-
NODE_183_length_3178_cov_ 10.617_ID_365 ...	fig\|1648.160.rna.6	rna	NODE_183_length_3178_cov_ 10.617_ID_365_2827_2755 ...	2827	2755	-
NODE_183_length_3178_cov_ 10.617_ID_365 ...	fig\|1648.160.rna.7	rna	NODE_183_length_3178_cov_ 10.617_ID_365_2910_2838 ...	2910	2838	-
NODE_184_length_3174_cov_ 6.24174_ID_367 ...	fig\|1648.160.rna.8	rna	NODE_184_length_3174_cov_ 6.24174_ID_367_3041_2969 ...	3041	2969	-
NODE_197_length_2970_cov_ 11.1692_ID_393 ...	fig\|1648.160.rna.9	rna	NODE_197_length_2970_cov_ 11.1692_ID_393_1043_971 ...	1043	971	-
NODE_197_length_2970_cov_ 11.1692_ID_393 ...	fig\|1648.160.rna.10	rna	NODE_197_length_2970_cov_ 11.1692_ID_393_1143_1070 ...	1143	1070	-

function	nucleotide_sequence	Calculated_length_nt
tRNA-Thr-GGT	gccgccatagttcaatggtaaaata gagcaatggtaatgctcagtt ...	71
tRNA-Trp-CCA	aggggtatagttcaatggtagagca acggtctccaaaaccgttaat ...	71
tRNA-Thr-CGT	gccggcttagctcaattggtagagc aactgaatcgtaatcagtagg ...	73
tRNA-Glu-TTC	ggcccgttggagaaacggttaactc acatgcctttcacgcatgcat ...	73
tRNA-Asn-GTT	gcttgtttagctcagtcggtagagc aactggctgttaaccagtggg ...	73
tRNA-Glu-TTC	ggcccgttggagaaacggttaactc acatgcctttcacgcatgcat ...	73
tRNA-Asn-GTT	gcttgtttagctcagtcggtagagc aactggctgttaaccagtggg ...	73
tRNA-Lys-CTT	gctccgttagctcagtcggtagagc aactgactcttaatcagtggg ...	73
tRNA-Ala-TGC	ggggttttagctcagctgggagagc gcctgccttgcacgcaggagg ...	73
tRNA-Ile-GAT	gggcctgtagctcagttggttagag cgcacccctgataagggtgag ...	74

Gene_GC	Calculated_length_aa
0.492957746479	0
0.507042253521	0
0.520547945205	0
0.547945205479	0
0.547945205479	0
0.547945205479	0
0.547945205479	0
0.575342465753	0
0.602739726027	0
0.581081081081	0

Feature ID	Seq_aa
fig\|1648.160.peg.1398	MKLTDKQLKIIETYGADTPESFLMV FPYRYERLEAKPRDAWRPGDT ...
fig\|1648.160.peg.1399	MTKWDLLDSMKIPYFNKDLINNAFI HSSYVNEAEELLEDNERLEFM ...
fig\|1648.160.peg.1428	MVIHFDLAQKEAKIKELESEMVEDG FWDDHRHSSQHIQKLNQLKKV ...
fig\|1648.160.peg.1429	MAGFLSNLFSGDRKILNEIEKIAHE IDALADETRALSDEALKEKTN ...
fig\|1648.160.peg.1394	MPKLDILNLEGKSLRELELNEAVFG IEPNNQTIFEAVVMQQASLRQ ...
fig\|1648.160.peg.1395	MKGLLGRKLGMTQVFTTDGKLIPVS VVEVLPNVVLQKKTMESDNYE ...
fig\|1648.160.peg.1396	MAKNFIRIRLKAYEHRTIDSAAQKI VQAANDHGAQKVVGPVPLPTE ...
fig\|1648.160.peg.1397	MAAKQAQDICEDKDIHVIETKTIPQ GLSACVMFNPDVDVEDNLTEM ...
fig\|1648.160.peg.1422	MVPYESTVFTNYQTIARLGLLNTYT ALIVPSLASVFYIFYLKGYLT ...
fig\|1648.160.peg.1391	MSRSLKKGPFCDLHLMNKVEKLNAE SKKEVIKTWSRRSTIFPQFVE ...

Feature ID	Seq_nt
fig\|1648.160.peg.1398	atgaaactcacggataaacaattaa aaattattgaaacttatggtg ...
fig\|1648.160.peg.1399	atgacaaaatgggatttattggatt ctatgaagattccatacttta ...
fig\|1648.160.peg.1428	ttggtgattcactttgacttagctc aaaaagaagctaaaatcaaag ...
fig\|1648.160.peg.1429	atggctggatttttaagtaatcttt ttagtggcgatcgtaagattt ...
fig\|1648.160.peg.1394	atgcctaagttagatattttaaacc ttgaaggcaagagtcttcgtg ...
fig\|1648.160.peg.1395	atgaaaggattactaggacgtaaat taggaatgacacaagtcttca ...
fig\|1648.160.peg.1396	atggcaaagaattttattcgtattc gtttaaaagcttatgaacacc ...
fig\|1648.160.peg.1397	atggctgcaaaacaagctcaagata tttgtgaagacaaagatatac ...
fig\|1648.160.peg.1390	atggaagtaaaagctatagttaaaa cagttcgtgttactccacgta ...
fig\|1648.160.peg.1391	atgagtcgtagtttaaaaaaaggac cattttgtgatcttcatttaa ...

Function	Subsystem	Seq_nt	Calculated_length_nt
Putative hemagglutinin /hemolysin-related ...	- none -	atgactgttaacgaggatgttacac tttatgcacaatggaaaaaat ...	492
ABC transporter, ATP- binding protein ...	- none -	atgaaattagaatttaatcaaatca ataagttttttgacgaacatc ...	909
ABC transporter, permease protein (putative) ...	- none -	atgaataaactcaaaacggttttta aatttgaatttttagaaatgt ...	1236
L-Aspartate dehydrogenase (EC 1.4.1.21) homolog ...	- none -	atgaaaaaattaaaattagcgttaa ttgggccaggatttctcaacg ...	786
hypothetical protein	- none -	atggaaaatcacgtaattatctatc aaacaaaagagcccaaacttc ...	540
COG1242: Predicted Fe-S oxidoreductase ...	An Fe-S oxireductase coupled with a ...	atgattttaaaaaatccagacttag ataagcaaatcgagcacagtg ...	756
Site-specific recombinase XerD ...	- none -	atgaaaaacaaagctaatgatttta ttcactacctccaatttattg ...	915
DNA polymerase IV (EC 2.7.7.7) ...	DNA repair, bacterial	atggcacaagtaatatttcatatag atattaacgcgttttatgcaa ...	1179
DNA repair protein RecN	DNA-replication; <br>DNA repair, bacterial ...	atgcttactcatttaacgatagata attttgttttgattcatcata ...	1656
Phage protein	- none -	atgtcaaataagagaggtaaatacg taaaatggcttgaacctaaaa ...	465

First Name	Last Name	Country	age
Bob	Smith	United States	24
Alice	Williams	Canada	23
Malcolm	Jone	England	22
Felix	Brown	USA	23
Alex	Cooper	Poland	23
Tod	Campbell	United States	22
Derek	Ward	Switzerland	25

First Name	Last Name	Country	age	Full Name
Bob	Smith	United States	24	Bob Smith
Alice	Williams	Canada	23	Alice Williams
Malcolm	Jone	England	22	Malcolm Jone
Felix	Brown	USA	23	Felix Brown
Alex	Cooper	Poland	23	Alex Cooper
Tod	Campbell	United States	22	Tod Campbell
Derek	Ward	Switzerland	25	Derek Ward