In [1]:

    
import sys
sys.version_info









    Out[1]:





sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)

petl Case Study 1 - Comparing Tables

This case study illustrates the use of the petl package for doing some simple profiling and comparison of data from two tables.

Introduction

The files used in this case study can be downloaded from the following link:

http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip

Download and unzip the files:



In [2]:

    
!wget http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip
!unzip -o petl-case-study-1-files.zip









    



--2015-01-19 17:37:39--  http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip
Resolving aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)... 54.231.9.241
Connecting to aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)|54.231.9.241|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3076773 (2.9M) [application/zip]
Saving to: ‘petl-case-study-1-files.zip’

100%[======================================>] 3,076,773   2.34MB/s   in 1.3s   

2015-01-19 17:37:41 (2.34 MB/s) - ‘petl-case-study-1-files.zip’ saved [3076773/3076773]

Archive:  petl-case-study-1-files.zip
  inflating: popdata.csv             
  inflating: snpdata.csv

The first file (snpdata.csv) contains a list of locations in the genome of the malaria parasite P. falciparum, along with some basic data about genetic variations found at those locations.

The second file (popdata.csv) is supposed to contain the same list of genome locations, along with some additional data such as allele frequencies in different populations.

The main point for this case study is that the first file (snpdata.csv) contains the canonical list of genome locations, and the second file (popdata.csv) contains some additional data about the same genome locations and therefore should be consistent with the first file. We want to check whether this second file is in fact consistent with the first file.

Preparing the data

Start by importing the petl package:



In [3]:

    
import petl as etl
etl.__version__









    Out[3]:





'1.0.0'

To save some typing, let a be the table of data extracted from the first file (snpdata.csv), and let b be the table of data extracted from the second file (popdata.csv), using the fromcsv() function:



In [4]:

    
a = etl.fromtsv('snpdata.csv')
b = etl.fromtsv('popdata.csv')

Examine the header from each file:



In [5]:

    
a.header()









    Out[5]:





('Chr',
 'Pos',
 'Ref',
 'Nref',
 'Der',
 'Mut',
 'isTypable',
 'GeneId',
 'GeneAlias',
 'GeneDescr')



In [6]:

    
b.header()









    Out[6]:





('Chromosome',
 'Coordinates',
 'Ref. Allele',
 'Non-Ref. Allele',
 'Outgroup Allele',
 'Ancestral Allele',
 'Derived Allele',
 'Ref. Aminoacid',
 'Non-Ref. Aminoacid',
 'Private Allele',
 'Private population',
 'maf AFR',
 'maf PNG',
 'maf SEA',
 'daf AFR',
 'daf PNG',
 'daf SEA',
 'nraf AFR',
 'nraf PNG',
 'nraf SEA',
 'Mutation type',
 'Gene',
 'Gene Aliases',
 'Gene Description',
 'Gene Information')

There is a common set of 9 fields that is present in both tables, and we would like focus on comparing these common fields, however different field names have been used in the two files. To simplify comparison, use rename() to rename some fields in the second file:



In [7]:

    
b_renamed = b.rename({'Chromosome': 'Chr', 
                      'Coordinates': 'Pos', 
                      'Ref. Allele': 'Ref', 
                      'Non-Ref. Allele': 'Nref', 
                      'Derived Allele': 'Der', 
                      'Mutation type': 'Mut', 
                      'Gene': 'GeneId', 
                      'Gene Aliases': 'GeneAlias', 
                      'Gene Description': 'GeneDescr'})
b_renamed.header()









    Out[7]:





('Chr',
 'Pos',
 'Ref',
 'Nref',
 'Outgroup Allele',
 'Ancestral Allele',
 'Der',
 'Ref. Aminoacid',
 'Non-Ref. Aminoacid',
 'Private Allele',
 'Private population',
 'maf AFR',
 'maf PNG',
 'maf SEA',
 'daf AFR',
 'daf PNG',
 'daf SEA',
 'nraf AFR',
 'nraf PNG',
 'nraf SEA',
 'Mut',
 'GeneId',
 'GeneAlias',
 'GeneDescr',
 'Gene Information')

Use cut() to extract only the fields we're interested in from both tables:



In [8]:

    
common_fields = ['Chr', 'Pos', 'Ref', 'Nref', 'Der', 'Mut', 'GeneId', 'GeneAlias', 'GeneDescr']
a_common = a.cut(common_fields)
b_common = b_renamed.cut(common_fields)

Inspect the data:



In [9]:

    
a_common









    Out[9]:







Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




MAL1
91099
G
A
-
S
PFA0095c
MAL1P1.10
rifin


MAL1
91104
A
T
-
N
PFA0095c
MAL1P1.10
rifin


MAL1
93363
T
A
-
N
PFA0100c
MAL1P1.11
hypothetical protein, conserved in P. falciparum


MAL1
93382
T
G
-
N
PFA0100c
MAL1P1.11
hypothetical protein, conserved in P. falciparum


MAL1
93384
G
A
-
N
PFA0100c
MAL1P1.11
hypothetical protein, conserved in P. falciparum



...



In [10]:

    
b_common









    Out[10]:







Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




MAL1
91099
G
A
-
SYN
PFA0095c
MAL1P1.10,RIF
rifin


MAL1
91104
A
T
-
NON
PFA0095c
MAL1P1.10,RIF
rifin


MAL1
93363
T
A
-
NON
PFA0100c
MAL1P1.11
Plasmodium exported protein (PHISTa), unknown function


MAL1
93382
T
G
-
NON
PFA0100c
MAL1P1.11
Plasmodium exported protein (PHISTa), unknown function


MAL1
93384
G
A
-
NON
PFA0100c
MAL1P1.11
Plasmodium exported protein (PHISTa), unknown function



...

The fromucsv() function does not attempt to parse any of the values from the underlying CSV file, so all values are reported as strings:



In [11]:

    
b_common.display(vrepr=repr)









    







Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




'MAL1'
'91099'
'G'
'A'
'-'
'SYN'
'PFA0095c'
'MAL1P1.10,RIF'
'rifin'


'MAL1'
'91104'
'A'
'T'
'-'
'NON'
'PFA0095c'
'MAL1P1.10,RIF'
'rifin'


'MAL1'
'93363'
'T'
'A'
'-'
'NON'
'PFA0100c'
'MAL1P1.11'
'Plasmodium exported protein (PHISTa), unknown function'


'MAL1'
'93382'
'T'
'G'
'-'
'NON'
'PFA0100c'
'MAL1P1.11'
'Plasmodium exported protein (PHISTa), unknown function'


'MAL1'
'93384'
'G'
'A'
'-'
'NON'
'PFA0100c'
'MAL1P1.11'
'Plasmodium exported protein (PHISTa), unknown function'



...

However, the 'Pos' field should be interpreted as an integer.

Also, the 'Mut' field has a different representation in the two tables, which needs to be converted before the data can be compared:



In [12]:

    
a_common.valuecounts('Mut')









    Out[12]:







Mut
count
frequency




N
71162
0.6865804123611875


S
31535
0.30425386166507473


-
950
0.009165725973737783



In [13]:

    
b_common.valuecounts('Mut')









    Out[13]:







Mut
count
frequency




NON
70880
0.6840510336042


SYN
32738
0.31594896639579995

Use the convert() function to convert the type of the 'Pos' field in both tables and the representation of the 'Mut' field in table b:



In [14]:

    
a_conv = a_common.convert('Pos', int)
b_conv = (
    b_common
    .convert('Pos', int)
    .convert('Mut', {'SYN': 'S', 'NON': 'N'})
)



In [15]:

    
highlight = 'background-color: yellow'
a_conv.display(caption='a', vrepr=repr, td_styles={'Pos': highlight})









    





a


Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




'MAL1'
91099
'G'
'A'
'-'
'S'
'PFA0095c'
'MAL1P1.10'
'rifin'


'MAL1'
91104
'A'
'T'
'-'
'N'
'PFA0095c'
'MAL1P1.10'
'rifin'


'MAL1'
93363
'T'
'A'
'-'
'N'
'PFA0100c'
'MAL1P1.11'
'hypothetical protein, conserved in P. falciparum'


'MAL1'
93382
'T'
'G'
'-'
'N'
'PFA0100c'
'MAL1P1.11'
'hypothetical protein, conserved in P. falciparum'


'MAL1'
93384
'G'
'A'
'-'
'N'
'PFA0100c'
'MAL1P1.11'
'hypothetical protein, conserved in P. falciparum'



...



In [16]:

    
b_conv.display(caption='b', vrepr=repr, td_styles={'Pos': highlight, 'Mut': highlight})









    





b


Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




'MAL1'
91099
'G'
'A'
'-'
'S'
'PFA0095c'
'MAL1P1.10,RIF'
'rifin'


'MAL1'
91104
'A'
'T'
'-'
'N'
'PFA0095c'
'MAL1P1.10,RIF'
'rifin'


'MAL1'
93363
'T'
'A'
'-'
'N'
'PFA0100c'
'MAL1P1.11'
'Plasmodium exported protein (PHISTa), unknown function'


'MAL1'
93382
'T'
'G'
'-'
'N'
'PFA0100c'
'MAL1P1.11'
'Plasmodium exported protein (PHISTa), unknown function'


'MAL1'
93384
'G'
'A'
'-'
'N'
'PFA0100c'
'MAL1P1.11'
'Plasmodium exported protein (PHISTa), unknown function'



...

Now the tables are ready for comparison.

Looking for missing or unexpected rows

Because both tables should contain the same list of genome locations, they should have the same number of rows. Use nrows() to compare:



In [17]:

    
a_conv.nrows()









    Out[17]:





103647



In [18]:

    
b_conv.nrows()









    Out[18]:





103618

There is some discrepancy. First investigate by comparing just the genomic locations, defined by the 'Chr' and 'Pos' fields, using complement():



In [19]:

    
a_locs = a_conv.cut('Chr', 'Pos')
b_locs = b_conv.cut('Chr', 'Pos')
locs_only_in_a = a_locs.complement(b_locs)
locs_only_in_a.nrows()









    Out[19]:





29



In [20]:

    
locs_only_in_a.displayall(caption='a only')









    





a only


Chr
Pos




MAL1
216961


MAL10
538210


MAL10
548779


MAL10
1432969


MAL11
500289


MAL11
1119809


MAL11
1278859


MAL12
51827


MAL13
183727


MAL13
398404


MAL13
627342


MAL13
1216664


MAL13
2750149


MAL14
1991758


MAL14
2297918


MAL14
2372268


MAL14
2994810


MAL2
38577


MAL2
64017


MAL4
1094258


MAL5
1335335


MAL5
1338718


MAL7
670602


MAL7
690509


MAL8
489937


MAL9
416116


MAL9
868677


MAL9
1201970


MAL9
1475245



In [21]:

    
locs_only_in_b = b_locs.complement(a_locs)
locs_only_in_b.nrows()









    Out[21]:





0

So it appears that 29 locations are missing from table b. Export these missing locations to a CSV file using toucsv():



In [22]:

    
locs_only_in_a.tocsv('missing_locations.csv')

An alternative method for finding rows in one table where some key value is not present in another table is to use the antijoin() function:



In [23]:

    
locs_only_in_a = a_conv.antijoin(b_conv, key=('Chr', 'Pos'))
locs_only_in_a.nrows()









    Out[23]:





29

Finding conflicts

We'd also like to compare the values given in the other fields, to find any discrepancies between the two tables.

The simplest way to find conflicts is to merge() both tables under a given key:



In [24]:

    
ab_merge = etl.merge(a_conv, b_conv, key=('Chr', 'Pos'))
ab_merge.display(caption='ab_merge', 
                 td_styles=lambda v: highlight if isinstance(v, etl.Conflict) else '')









    





ab_merge


Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




MAL1
91099
G
A
-
S
PFA0095c
Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'})
rifin


MAL1
91104
A
T
-
N
PFA0095c
Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'})
rifin


MAL1
93363
T
A
-
N
PFA0100c
MAL1P1.11
Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})


MAL1
93382
T
G
-
N
PFA0100c
MAL1P1.11
Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})


MAL1
93384
G
A
-
N
PFA0100c
MAL1P1.11
Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})



...

From a glance at the conflicts above, it appears there are discrepancies in the 'GeneAlias' and 'GeneDescr' fields. There may also be conflicts in other fields, so we need to investigate further.

Note that the table ab_merge will contain all rows, not only those containing conflicts. To find only conflicting rows, use cat() then conflicts(), e.g.:



In [25]:

    
ab = etl.cat(a_conv.addfield('source', 'a', index=0), 
             b_conv.addfield('source', 'b', index=0))
ab_conflicts = ab.conflicts(key=('Chr', 'Pos'), exclude='source')
ab_conflicts.display(10)









    







source
Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




a
MAL1
91099
G
A
-
S
PFA0095c
MAL1P1.10
rifin


b
MAL1
91099
G
A
-
S
PFA0095c
MAL1P1.10,RIF
rifin


a
MAL1
91104
A
T
-
N
PFA0095c
MAL1P1.10
rifin


b
MAL1
91104
A
T
-
N
PFA0095c
MAL1P1.10,RIF
rifin


a
MAL1
93363
T
A
-
N
PFA0100c
MAL1P1.11
hypothetical protein, conserved in P. falciparum


b
MAL1
93363
T
A
-
N
PFA0100c
MAL1P1.11
Plasmodium exported protein (PHISTa), unknown function


a
MAL1
93382
T
G
-
N
PFA0100c
MAL1P1.11
hypothetical protein, conserved in P. falciparum


b
MAL1
93382
T
G
-
N
PFA0100c
MAL1P1.11
Plasmodium exported protein (PHISTa), unknown function


a
MAL1
93384
G
A
-
N
PFA0100c
MAL1P1.11
hypothetical protein, conserved in P. falciparum


b
MAL1
93384
G
A
-
N
PFA0100c
MAL1P1.11
Plasmodium exported protein (PHISTa), unknown function



...

Finally, let's find conflicts in a specific field:



In [26]:

    
ab_conflicts_mut = ab.conflicts(key=('Chr', 'Pos'), include='Mut')
ab_conflicts_mut.display(10, caption='Mut conflicts', td_styles={'Mut': highlight})









    





Mut conflicts


source
Chr
Pos
Ref
Nref
Der
Mut
GeneId
GeneAlias
GeneDescr




a
MAL1
99099
G
T
-
-
PFA0110w
MAL1P1.13,Pf155
ring-infected erythrocyte surface antigen


b
MAL1
99099
G
T
-
N
PFA0110w
MAL1P1.13,Pf155,RESA
ring-infected erythrocyte surface antigen


a
MAL1
99211
C
T
-
-
PFA0110w
MAL1P1.13,Pf155
ring-infected erythrocyte surface antigen


b
MAL1
99211
C
T
-
N
PFA0110w
MAL1P1.13,Pf155,RESA
ring-infected erythrocyte surface antigen


a
MAL1
197903
C
A
A
S
PFA0220w
MAL1P1.34b
ubiquitin carboxyl-terminal hydrolase, putative


b
MAL1
197903
C
A
A
N
PFA0220w
PFA0215w,MAL1P1.34b
ubiquitin carboxyl-terminal hydrolase, putative


a
MAL1
384429
C
T
-
N
PFA0485w
MAL1P2.26
dolichol kinase


b
MAL1
384429
C
T
-
S
-
-
-


a
MAL1
513268
A
G
-
N
PFA0650w
MAL1P3.12,MAL1P3.12a,PFA0655w
surface-associated interspersed gene pseudogene, (SURFIN) pseudogene


b
MAL1
513268
A
G
-
S
PFA0650w
MAL1P3.12,PFA0655,MAL1P3.12a,3D7surf1.2,PFA0655w,MAL1P12a
surface-associated interspersed gene (SURFIN), pseudogene



...



In [27]:

    
ab_conflicts_mut.nrows()









    Out[27]:





3592

For more information about the petl package see the petl online documentation.

Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
MAL1	91099	G	A	-	S	PFA0095c	MAL1P1.10	rifin
MAL1	91104	A	T	-	N	PFA0095c	MAL1P1.10	rifin
MAL1	93363	T	A	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum
MAL1	93382	T	G	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum
MAL1	93384	G	A	-	N	PFA0100c	MAL1P1.11	hypothetical protein, conserved in P. falciparum

Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
'MAL1'	'91099'	'G'	'A'	'-'	'SYN'	'PFA0095c'	'MAL1P1.10,RIF'	'rifin'
'MAL1'	'91104'	'A'	'T'	'-'	'NON'	'PFA0095c'	'MAL1P1.10,RIF'	'rifin'
'MAL1'	'93363'	'T'	'A'	'-'	'NON'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'
'MAL1'	'93382'	'T'	'G'	'-'	'NON'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'
'MAL1'	'93384'	'G'	'A'	'-'	'NON'	'PFA0100c'	'MAL1P1.11'	'Plasmodium exported protein (PHISTa), unknown function'

Mut	count	frequency
N	71162	0.6865804123611875
S	31535	0.30425386166507473
-	950	0.009165725973737783

Chr	Pos
MAL1	216961
MAL10	538210
MAL10	548779
MAL10	1432969
MAL11	500289
MAL11	1119809
MAL11	1278859
MAL12	51827
MAL13	183727
MAL13	398404
MAL13	627342
MAL13	1216664
MAL13	2750149
MAL14	1991758
MAL14	2297918
MAL14	2372268
MAL14	2994810
MAL2	38577
MAL2	64017
MAL4	1094258
MAL5	1335335
MAL5	1338718
MAL7	670602
MAL7	690509
MAL8	489937
MAL9	416116
MAL9	868677
MAL9	1201970
MAL9	1475245

source	Chr	Pos	Ref	Nref	Der	Mut	GeneId	GeneAlias	GeneDescr
a	MAL1	99099	G	T	-	-	PFA0110w	MAL1P1.13,Pf155	ring-infected erythrocyte surface antigen
b	MAL1	99099	G	T	-	N	PFA0110w	MAL1P1.13,Pf155,RESA	ring-infected erythrocyte surface antigen
a	MAL1	99211	C	T	-	-	PFA0110w	MAL1P1.13,Pf155	ring-infected erythrocyte surface antigen
b	MAL1	99211	C	T	-	N	PFA0110w	MAL1P1.13,Pf155,RESA	ring-infected erythrocyte surface antigen
a	MAL1	197903	C	A	A	S	PFA0220w	MAL1P1.34b	ubiquitin carboxyl-terminal hydrolase, putative
b	MAL1	197903	C	A	A	N	PFA0220w	PFA0215w,MAL1P1.34b	ubiquitin carboxyl-terminal hydrolase, putative
a	MAL1	384429	C	T	-	N	PFA0485w	MAL1P2.26	dolichol kinase
b	MAL1	384429	C	T	-	S	-	-	-
a	MAL1	513268	A	G	-	N	PFA0650w	MAL1P3.12,MAL1P3.12a,PFA0655w	surface-associated interspersed gene pseudogene, (SURFIN) pseudogene
b	MAL1	513268	A	G	-	S	PFA0650w	MAL1P3.12,PFA0655,MAL1P3.12a,3D7surf1.2,PFA0655w,MAL1P12a	surface-associated interspersed gene (SURFIN), pseudogene