pandasVCF

This example notebook describes simple usage of pandasVCFmulti, a module for parsing VCF files using the pandas library. pandasVCFmulti also handles single sample vcf files.

Libraries



In [1]:

    
#Import pdVCFsingle package
%matplotlib inline
%pylab inline
import sys
sys.path.append( '../src/' )
from pandasvcf import *
%config InlineBackend.figure_format = 'retina'
pd.options.mode.chained_assignment = None #supressing the chained assignment warnings









    



Populating the interactive namespace from numpy and matplotlib

Example File Path



In [2]:

    
vcf_path = '../test_data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v4.20130502.genotypes_10k.vcf.gz'

Creating Vcf object

Initiate Vcf object by specifying the sample_id string and the columns the user wants to include for parsing.

Only the CHROM, POS, REF, ALT, and FORMAT fields are required.

Some VCF files are quite large and will not fit in memory, therefore the user can specify the chunksize which allows iteration through the VCF.



In [3]:

    
vcf_chunk = VCF(vcf_path, sample_id='all', cols=['#CHROM', 'POS', 'REF', 'ALT', 'FORMAT', 'INFO', 'FILTER'], \
                chunksize=1000, n_cores=20)



In [4]:

    
%time vcf_chunk.get_vcf_df_chunk()









    



CPU times: user 38.7 s, sys: 528 ms, total: 39.2 s
Wall time: 38.9 s






    Out[4]:





0



In [5]:

    
vcf_chunk.df.info()
print 
print vcf_chunk.df.shape[1] * vcf_chunk.df.shape[0], 'Genotypes read'









    



<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1000 entries, (22, 16050075, A, G) to (22, 16139996, G, T)
Columns: 2511 entries, CHROM to NA21144
dtypes: int64(1), object(2510)
memory usage: 19.2+ MB

2511000 Genotypes read



In [6]:

    
vcf_chunk.df.head()









    Out[6]:






  
    
      
      
      
      
      CHROM
      POS
      REF
      ALT
      FILTER
      INFO
      FORMAT
      HG00096
      HG00097
      HG00099
      ...
      NA21128
      NA21129
      NA21130
      NA21133
      NA21135
      NA21137
      NA21141
      NA21142
      NA21143
      NA21144
    
    
      CHROM
      POS
      REF
      ALT
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      22
      16050075
      A
      G
      22
      16050075
      A
      G
      PASS
      AC=1;AF=0.000199681;AN=5008;NS=2504
      GT
      0|0
      0|0
      0|0
      ...
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
    
    
      16050115
      G
      A
      22
      16050115
      G
      A
      PASS
      AC=32;AF=0.00638978;AN=5008;NS=2504
      GT
      0|0
      0|0
      0|0
      ...
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
    
    
      16050213
      C
      T
      22
      16050213
      C
      T
      PASS
      AC=38;AF=0.00758786;AN=5008;NS=2504
      GT
      0|0
      0|0
      0|0
      ...
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
    
    
      16050319
      C
      T
      22
      16050319
      C
      T
      PASS
      AC=1;AF=0.000199681;AN=5008;NS=2504
      GT
      0|0
      0|0
      0|0
      ...
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
    
    
      16050527
      C
      A
      22
      16050527
      C
      A
      PASS
      AC=1;AF=0.000199681;AN=5008;NS=2504
      GT
      0|0
      0|0
      0|0
      ...
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
      0|0
    
  

5 rows × 2511 columns



In [7]:

    
#checking stopIteration flag
vcf_chunk.stopIteration









    Out[7]:





False

Adding Annotations



In [8]:

    
%time vcf_chunk.add_variant_annotations(inplace=True)  #split_columns={'AD':2, 'HQ':2},









    



CPU times: user 963 ms, sys: 141 ms, total: 1.1 s
Wall time: 2.47 s






    Out[8]:





0



In [9]:

    
vcf_chunk.df.info()









    



<class 'pandas.core.frame.DataFrame'>
MultiIndex: 101737 entries, (22, 16050075, A, G) to (22, 16139996, G, T)
Data columns (total 15 columns):
sample_ids        101737 non-null object
multiallele       101737 non-null int64
phase             101737 non-null object
GT1               101737 non-null int64
GT2               101737 non-null int64
a1                101737 non-null object
a2                101737 non-null object
zygosity          101737 non-null object
vartype1          101737 non-null object
vartype2          101737 non-null object
GT                101737 non-null object
FORMAT            101737 non-null object
hom_ref_counts    101737 non-null float64
INFO              101737 non-null object
FILTER            101737 non-null object
dtypes: float64(1), int64(3), object(11)
memory usage: 12.1+ MB

Unstacking the parsed dataframe by sample leads to sparsity due to rare variants



In [10]:

    
#unstack dataframe by sample - QUITE SPARSE DUE TO RARE VARIANTS
vcf_chunk.df.set_index('sample_ids', append=True).unstack(level=4).tail()









    Out[10]:






  
    
      
      
      
      
      multiallele
      ...
      FILTER
    
    
      
      
      
      sample_ids
      HG00096
      HG00097
      HG00099
      HG00100
      HG00101
      HG00102
      HG00103
      HG00105
      HG00106
      HG00107
      ...
      NA21128
      NA21129
      NA21130
      NA21133
      NA21135
      NA21137
      NA21141
      NA21142
      NA21143
      NA21144
    
    
      CHROM
      POS
      REF
      ALT
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      22
      16139873
      C
      T
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      16139876
      C
      T
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      16139887
      A
      T
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      16139971
      A
      G
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      16139996
      G
      T
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 35056 columns

CONVENINCE FUNCTION FOR PARSING AN ENTIRE MULTISAMPLE FILE

!!! Known Issue: get_whole_file will break if there are duplicate rows for the same genotype.



In [11]:

    
def get_whole_file(vcf_path, sample_ids='all', columns=['#CHROM', 'POS', 'REF', 'ALT', 'FORMAT'], \
                   add_variant_annotations=True, split_columns='', chunksize=5000, inplace=True, n_cores=1):
    '''
    This function will parse the whole multi-sample vcf file
    and return a dataframe.
    
    Note using multiple cores with add_variant_annotations will be 
    very memory intensive as the parsed dataframe is copied to each process.
    '''
    
    vcf_df_obj = Vcf(vcf_path, sample_id=sample_ids, cols=columns, chunksize=chunksize, n_cores=n_cores)  #initiate object
    stopIteration = False  #initiating stopIteration flag
    data = []  #aggregation df list
    
    while stopIteration == False:

        vcf_df_obj.get_vcf_df_chunk()  #retrieving df chunk
        if vcf_df_obj.stopIteration == True: break  #checking for end of file
        
        if add_variant_annotations:  
            vcf_df_obj.add_variant_annotations(split_columns=split_columns, inplace=inplace)  #parsing df and adding annotations
            if inplace:
                data.append(vcf_df_obj.df)
            else:
                data.append(vcf_df_obj.df_annot)  #aggregating annotation data
        else:
            vcf_df_obj.append(vcf_df_obj.df)

    df = pd.concat(data)
    return df



In [12]:

    
%time master_df = get_whole_file(vcf_path, sample_ids='all', \
                                 columns=['#CHROM', 'POS', 'REF', 'ALT','FORMAT', 'INFO'], \
                                 chunksize=5000, n_cores=20)









    



End of File Reached
CPU times: user 1min 31s, sys: 2.53 s, total: 1min 34s
Wall time: 1min 40s



In [13]:

    
master_df.zygosity.value_counts().plot(kind='bar', log=True, grid=True, color='seagreen')









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e904e50>



In [14]:

    
master_df.vartype2.value_counts().plot(kind='bar', log=True, grid=True, color='seagreen')









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x113d20210>



In [15]:

    
master_df.vartype2.value_counts()









    Out[15]:





snp    483986
ref    244660
del     19604
ins      7551
dtype: int64



In [16]:

    
len(master_df)









    Out[16]:





755801



In [17]:

    
master_df.head(20)









    Out[17]:






  
    
      
      
      
      
      sample_ids
      multiallele
      phase
      GT1
      GT2
      a1
      a2
      zygosity
      vartype1
      vartype2
      GT
      FORMAT
      hom_ref_counts
      INFO
    
    
      CHROM
      POS
      REF
      ALT
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      22
      16050075
      A
      G
      HG03770
      0
      |
      0
      1
      A
      G
      het-ref
      ref
      snp
      0|1
      GT
      2503
      AC=1;AF=0.000199681;AN=5008;NS=2504
    
    
      16050115
      G
      A
      HG01363
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG02334
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG02343
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG02574
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG03052
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG03354
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG03432
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG03473
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA18516
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA18858
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA18874
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA19027
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA19121
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA19137
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA19707
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      NA19984
      0
      |
      0
      1
      G
      A
      het-ref
      ref
      snp
      0|1
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG02497
      0
      |
      1
      0
      A
      G
      het-ref
      snp
      ref
      1|0
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG02536
      0
      |
      1
      0
      A
      G
      het-ref
      snp
      ref
      1|0
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504
    
    
      A
      HG02623
      0
      |
      1
      0
      A
      G
      het-ref
      snp
      ref
      1|0
      GT
      2472
      AC=32;AF=0.00638978;AN=5008;NS=2504



In [18]:

    
master_df.info()









    



<class 'pandas.core.frame.DataFrame'>
MultiIndex: 755801 entries, (22, 16050075, A, G) to (22, 16644712, G, C)
Data columns (total 14 columns):
sample_ids        755801 non-null object
multiallele       755801 non-null int64
phase             755801 non-null object
GT1               755801 non-null int64
GT2               755801 non-null int64
a1                755801 non-null object
a2                755801 non-null object
zygosity          755801 non-null object
vartype1          755801 non-null object
vartype2          755801 non-null object
GT                755801 non-null object
FORMAT            755801 non-null object
hom_ref_counts    755801 non-null float64
INFO              755801 non-null object
dtypes: float64(1), int64(3), object(10)
memory usage: 84.4+ MB

				CHROM	POS	REF	ALT	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	...	NA21128	NA21129	NA21130	NA21133	NA21135	NA21137	NA21141	NA21142	NA21143	NA21144
CHROM	POS	REF	ALT
22	16050075	A	G	22	16050075	A	G	PASS	AC=1;AF=0.000199681;AN=5008;NS=2504	GT	0\|0	0\|0	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
	16050115	G	A	22	16050115	G	A	PASS	AC=32;AF=0.00638978;AN=5008;NS=2504	GT	0\|0	0\|0	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
	16050213	C	T	22	16050213	C	T	PASS	AC=38;AF=0.00758786;AN=5008;NS=2504	GT	0\|0	0\|0	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
	16050319	C	T	22	16050319	C	T	PASS	AC=1;AF=0.000199681;AN=5008;NS=2504	GT	0\|0	0\|0	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
	16050527	C	A	22	16050527	C	A	PASS	AC=1;AF=0.000199681;AN=5008;NS=2504	GT	0\|0	0\|0	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0

				multiallele										...	FILTER
			sample_ids	HG00096	HG00097	HG00099	HG00100	HG00101	HG00102	HG00103	HG00105	HG00106	HG00107	...	NA21128	NA21129	NA21130	NA21133	NA21135	NA21137	NA21141	NA21142	NA21143	NA21144
CHROM	POS	REF	ALT
22	16139873	C	T	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	16139876	C	T	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	16139887	A	T	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	16139971	A	G	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	16139996	G	T	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

				sample_ids	multiallele	phase	GT1	GT2	a1	a2	zygosity	vartype1	vartype2	GT	FORMAT	hom_ref_counts	INFO
CHROM	POS	REF	ALT
22	16050075	A	G	HG03770	0	\|	0	1	A	G	het-ref	ref	snp	0\|1	GT	2503	AC=1;AF=0.000199681;AN=5008;NS=2504
	16050115	G	A	HG01363	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG02334	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG02343	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG02574	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG03052	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG03354	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG03432	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG03473	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA18516	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA18858	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA18874	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA19027	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA19121	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA19137	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA19707	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	NA19984	0	\|	0	1	G	A	het-ref	ref	snp	0\|1	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG02497	0	\|	1	0	A	G	het-ref	snp	ref	1\|0	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG02536	0	\|	1	0	A	G	het-ref	snp	ref	1\|0	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504
			A	HG02623	0	\|	1	0	A	G	het-ref	snp	ref	1\|0	GT	2472	AC=32;AF=0.00638978;AN=5008;NS=2504