In [31]:
import pandas as pd
import numpy as np
import os
data_path = '/home/jovyan/work/LEAP/leap/regression/dataset1'
os.chdir(data_path)
In [32]:
os.listdir('./')
Out[32]:
In [3]:
import pip
In [4]:
pip.main(['install', 'pysnptools'])
Out[4]:
In [33]:
from pysnptools.snpreader.bed import Bed
import pysnptools.util as pstutil
import pysnptools.util.pheno as phenoUtils
In [34]:
bfile = 'dataset1'
phenoFile = bfile+'.phe'
chromosomes = xrange(1,11)
prevalence = 0.001
bed = Bed(bfile).read().standardize()
causalSNPs = [s for s in bed.sid if 'csnp' in s]
In [36]:
bed.sid[:5]
Out[36]:
In [8]:
bed.iid
Out[8]:
In [39]:
print "bed matrix shape:", bed.val.shape
print "Size of bed matrix: %4.0fmb" %(bed.val.nbytes/(1024**2))
bed.val[0,:100]
Out[39]:
In [26]:
f = Bed(bfile)
f.sid_count
Out[26]:
In [29]:
np.unique(f.read().val)
Out[29]:
In both Python 2.7 and Python 3.5 the modules that read the .bed files perpetually keep the .bed file open. The one in Python3.5 can close the connection though, pysnptools can't(at least couldn't find a way).
The object 'bed' from Bed(file).read().standardize() has a method bid.sid that enumerates all of the SNPS contained in the file. The equivalent in the plinkio is plinkio.open(file).get_locus() and then each item in locus contains a SNP and the information thereon, including which chromosome it comes from.
In [10]:
import pysnptools.util as pstutil
import pysnptools.util.pheno as phenoUtils
In [9]:
pheno = phenoUtils.loadOnePhen('dataset1.phe', missing='-9', vectorize=True)
In [19]:
pheno['iid']
Out[19]:
In [24]:
print("Size of pheno matrix: ",pheno['vals'].shape)
np.unique(pheno['vals'])
Out[24]:
In [24]:
bed, pheno = pstutil.intersect_apply([f, pheno])
bed.sid_count
In [ ]:
Apparently, the pheno['iid'] is suppose to match the bed.iid object. When the writers do "checkIntersection" they essentially check whether the persons and family of bed and pheno objects match up. If not, it throws an error. I presume this is how the two are matched up. Either way, "checkIntersection" doesn't return anything. I think pstutil.intersect_apply is the function that intersects the two files? Documentation isn't clear either.
It seems that plinkio reads in phenotype automatically and the phenotype can be read from the generator items in sample_list. However, it seems that the phenotypes returned by plinkio are approximately normal...which is strange and may be due to our test data file.
Information such as family, iid, and the like can also be accessed in a similar fashion as phenotypes
In [11]:
covarsDict = phenoUtils.loadPhen('dataset1.cov')
In [12]:
covarsDict.keys()
Out[12]:
In [13]:
covarsDict['iid']
Out[13]:
In [15]:
covarsDict['vals'][:10]
Out[15]:
In [ ]:
#related file is apparently generated and then is optionally in. Otherwise it's created from bed
#relatedDict = phenoUtils.loadOnePhen(relFile, vectorize=True)