Introduction to Pygenprop

An python library for interactive programatic usage of Genome Properties

InterProScan files used in this tutorial can be found at:


In [83]:
import requests
from io import StringIO
from pygenprop.results import GenomePropertiesResults
from pygenprop.database_file_parser import parse_genome_properties_flat_file
from pygenprop.assignment_file_parser import parse_interproscan_file, parse_genome_property_longform_file

In [84]:
# The Genome Properties is a flat file database that can be fount on Github.
# The latest release of the database can be found at the following URL.

genome_properties_database_url = 'https://raw.githubusercontent.com/ebi-pf-team/genome-properties/master/flatfiles/genomeProperties.txt'

# For this tutorial we will stream the file directly into the Jupyter notebook. Alternativly, 
# one could be downloaded the file with the unix wget or curl commands.

with requests.Session() as current_download:
    response = current_download.get(genome_properties_database_url, stream=True)
    tree = parse_genome_properties_flat_file(StringIO(response.text))

In [85]:
# There are 1286 properties in the Genome Properties tree.
len(tree)


Out[85]:
1286

In [86]:
# Find all properties of type "GUILD".
for genome_property in tree:
    if genome_property.type == 'GUILD':
        print(genome_property.name)


Coenzyme F420 utilization
CRISPR region
Reduction of oxidized methionine
Phage: major features
Resistance to Reactive Oxygen Species (ROS)
tRNA aminoacylation
Toxin-antitoxin system, type II
Protein-coding palindromic elements
Flagellar components of unknown function
Bacillithiol utilization
Toxin-antitoxin system, type I
Toxin-antitoxin system, type III
Abortive infection proteins
Energy-coupling factor transporters
Initiator caspases of the apoptosis extrinsic pathway
Executor caspases of apoptosis

In [87]:
# Parse InterProScan files
with open('E_coli_K12.tsv') as ipr5_file_one:
    assignment_cache_1 = parse_interproscan_file(ipr5_file_one)

In [88]:
with open('E_coli_O157_H7.tsv') as ipr5_file_two:
    assignment_cache_2 = parse_interproscan_file(ipr5_file_two)

In [89]:
# Create results comparison object
results = GenomePropertiesResults(assignment_cache_1, assignment_cache_2, properties_tree=tree)

In [90]:
# Get property by identifier
virulence = tree['GenProp0074']

In [91]:
virulence


Out[91]:
GenProp0074, Type: CATEGORY, Name: Virulence, Thresh: 0, References: False, Databases: False, Steps: True, Parents: True, Children: True, Public: False

In [92]:
# Iterate to get the identifiers of child properties of virulence
types_of_vir = [genprop.id for genprop in virulence.children]

In [93]:
# The property results property is used to compare two property assignments between samples.
results.property_results


Out[93]:
E_coli_K12 E_coli_O157_H7
Property_Identifier
GenProp0724 NO NO
GenProp0757 YES YES
GenProp0809 NO NO
GenProp0853 NO NO
GenProp0861 NO NO
GenProp0901 NO NO
GenProp0919 NO NO
GenProp0920 NO NO
GenProp0921 NO NO
GenProp0936 NO NO
GenProp0945 NO NO
GenProp0954 NO NO
GenProp0955 NO NO
GenProp0956 NO NO
GenProp0962 NO NO
GenProp0967 NO NO
GenProp0982 NO NO
GenProp0984 NO NO
GenProp0991 NO NO
GenProp1000 NO NO
GenProp1002 NO NO
GenProp1003 NO NO
GenProp1037 NO NO
GenProp1052 NO NO
GenProp1062 NO NO
GenProp1065 NO NO
GenProp1078 NO NO
GenProp1083 NO NO
GenProp1084 NO NO
GenProp1090 NO NO
... ... ...
GenProp0318 NO NO
GenProp0319 NO NO
GenProp0320 PARTIAL PARTIAL
GenProp0469 PARTIAL PARTIAL
GenProp0670 NO NO
GenProp0685 NO NO
GenProp0768 PARTIAL PARTIAL
GenProp0922 PARTIAL PARTIAL
GenProp1061 NO NO
GenProp1106 PARTIAL PARTIAL
GenProp1093 NO NO
GenProp2007 NO NO
GenProp1067 PARTIAL PARTIAL
GenProp2085 NO NO
GenProp2038 NO NO
GenProp2086 NO NO
GenProp2087 NO NO
GenProp2088 NO NO
GenProp2089 NO NO
GenProp2090 NO NO
GenProp2092 NO NO
GenProp2093 NO NO
GenProp2094 NO NO
GenProp2097 NO NO
GenProp2095 NO NO
GenProp2096 NO NO
GenProp2098 NO NO
GenProp2099 NO NO
GenProp1778 NO NO
GenProp0065 PARTIAL PARTIAL

1286 rows × 2 columns


In [94]:
# The step results property is used to compare two step assignments between samples.
results.step_results


Out[94]:
E_coli_K12 E_coli_O157_H7
Property_Identifier Step_Number
GenProp0724 1 NO NO
2 NO NO
3 NO NO
4 YES YES
5 YES YES
6 NO NO
7 YES YES
8 NO NO
GenProp0077 2 NO NO
3 YES YES
4 NO NO
5 NO NO
6 NO NO
7 NO NO
8 NO NO
9 NO NO
10 NO NO
11 NO NO
12 NO NO
13 NO NO
14 NO NO
15 NO NO
16 NO NO
17 NO NO
18 NO NO
19 NO NO
20 NO NO
21 NO NO
22 NO NO
23 NO NO
... ... ... ...
GenProp2095 6 NO NO
7 NO NO
8 NO NO
9 NO NO
GenProp2097 1 NO NO
2 NO NO
3 NO NO
4 NO NO
GenProp2096 1 NO NO
2 NO NO
4 NO NO
5 NO NO
6 NO NO
GenProp2098 1 NO NO
2 NO NO
4 NO NO
5 NO NO
6 NO NO
7 NO NO
GenProp2099 1 NO NO
2 NO NO
3 YES YES
4 NO NO
5 NO NO
6 NO NO
7 NO NO
8 NO NO
9 NO NO
10 NO NO
11 NO NO

6525 rows × 2 columns


In [95]:
# Get properties with differing assignments
results.differing_property_results


Out[95]:
E_coli_K12 E_coli_O157_H7
Property_Identifier
GenProp0111 YES PARTIAL
GenProp1032 YES PARTIAL
GenProp0183 YES PARTIAL
GenProp1695 NO PARTIAL
GenProp1331 PARTIAL YES
GenProp1388 NO YES
GenProp0051 NO YES
GenProp0232 PARTIAL YES
GenProp0236 PARTIAL YES
GenProp0455 YES PARTIAL
GenProp0687 YES PARTIAL
GenProp0686 YES PARTIAL
GenProp0139 NO PARTIAL
GenProp1365 YES PARTIAL
GenProp0283 YES NO
GenProp1297 YES PARTIAL
GenProp1326 YES PARTIAL
GenProp1501 PARTIAL NO
GenProp1402 YES PARTIAL
GenProp1299 YES PARTIAL
GenProp1463 YES PARTIAL
GenProp1568 YES NO
GenProp1374 YES PARTIAL
GenProp1556 YES PARTIAL
GenProp1566 YES PARTIAL
GenProp0938 YES PARTIAL
GenProp0052 NO PARTIAL
GenProp0059 NO YES
GenProp0735 NO YES
GenProp1074 NO YES
GenProp0961 NO YES
GenProp0820 PARTIAL YES
GenProp0323 PARTIAL YES
GenProp1120 YES NO
GenProp1133 YES NO
GenProp1189 YES NO
GenProp0176 YES NO
GenProp1094 NO PARTIAL

In [96]:
# Get property assignments for virulence properties
results.get_results(*types_of_vir, steps=False)


Out[96]:
E_coli_K12 E_coli_O157_H7
Property_Identifier
GenProp0052 NO PARTIAL
GenProp0648 YES YES
GenProp0707 NO NO

In [97]:
# Get step assignments for virulence properties
results.get_results(*types_of_vir, steps=True)


Out[97]:
E_coli_K12 E_coli_O157_H7
Property_Identifier Step_Number
GenProp0052 1 NO NO
2 NO NO
3 NO NO
4 NO NO
5 NO NO
6 NO YES
7 NO NO
8 NO YES
9 NO NO
10 YES YES
11 NO NO
12 NO NO
13 NO YES
14 NO YES
15 NO NO
16 NO YES
17 NO NO
18 NO NO
19 NO YES
21 NO NO
22 YES YES
24 NO YES
25 NO NO
26 NO YES
27 NO YES
28 NO YES
29 NO NO
30 NO NO
31 NO YES
32 NO YES
33 NO NO
34 NO NO
35 NO YES
36 NO NO
37 NO YES
38 NO YES
39 NO YES
40 NO YES
41 NO YES
42 NO YES
43 NO NO
44 NO NO
GenProp0648 1 YES YES
2 YES YES
3 YES YES
4 YES YES
5 YES YES
6 YES YES
7 YES YES
GenProp0707 1 NO NO
2 NO NO
3 NO NO
4 NO NO
5 NO NO
6 NO NO

In [98]:
# Get counts of virulence properties assigned YES, NO, and PARTIAL per organism
results.get_results_summary(*types_of_vir, steps=False, normalize=False)


Out[98]:
E_coli_K12 E_coli_O157_H7
NO 2.0 1
PARTIAL 0.0 1
YES 1.0 1

In [99]:
# Get counts of virulence steps assigned YES, NO, and PARTIAL per organism
results.get_results_summary(*types_of_vir, steps=True, normalize=False)


Out[99]:
E_coli_K12 E_coli_O157_H7
NO 46 27
YES 9 28

In [100]:
# Get percentages of virulence steps assigned YES, NO, and PARTIAL per organism
results.get_results_summary(*types_of_vir, steps=True, normalize=True)


Out[100]:
E_coli_K12 E_coli_O157_H7
NO 83.636364 49.090909
YES 16.363636 50.909091

In [ ]: