In [31]:
import pandas as pd
import numpy as np
import os
data_path = '/home/jovyan/work/LEAP/leap/regression/dataset1'
os.chdir(data_path)

In [32]:
os.listdir('./')


Out[32]:
['extracts',
 'dataset1-Copy1.bed',
 'dataset1.bed',
 'dataset1.cov',
 '.pversion',
 'dataset1.bim',
 'dataset1.fam',
 'dataset1.phe.liab',
 'dataset1.phe']

In [3]:
import pip

In [4]:
pip.main(['install', 'pysnptools'])


Collecting pysnptools
  Downloading pysnptools-0.3.9.zip (225kB)
Requirement already satisfied (use --upgrade to upgrade): scipy>=0.15.1 in /opt/conda/envs/python2/lib/python2.7/site-packages (from pysnptools)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.9.2 in /opt/conda/envs/python2/lib/python2.7/site-packages (from pysnptools)
Requirement already satisfied (use --upgrade to upgrade): pandas>=0.16.2 in /opt/conda/envs/python2/lib/python2.7/site-packages (from pysnptools)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /opt/conda/envs/python2/lib/python2.7/site-packages (from pandas>=0.16.2->pysnptools)
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /opt/conda/envs/python2/lib/python2.7/site-packages (from pandas>=0.16.2->pysnptools)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /opt/conda/envs/python2/lib/python2.7/site-packages (from python-dateutil->pandas>=0.16.2->pysnptools)
Building wheels for collected packages: pysnptools
  Running setup.py bdist_wheel for pysnptools: started
  Running setup.py bdist_wheel for pysnptools: finished with status 'done'
  Stored in directory: /home/jovyan/.cache/pip/wheels/67/3f/0a/2d16f1dca863803132f2fd208c928d6c388d4d0161bd8a94f9
Successfully built pysnptools
Installing collected packages: pysnptools
Successfully installed pysnptools-0.3.9
You are using pip version 8.0.3, however version 8.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Out[4]:
0

In [33]:
from pysnptools.snpreader.bed import Bed
import pysnptools.util as pstutil
import pysnptools.util.pheno as phenoUtils

Getting a grip on bed files


In [34]:
bfile = 'dataset1'
phenoFile = bfile+'.phe'
chromosomes = xrange(1,11)
prevalence = 0.001
bed = Bed(bfile).read().standardize()
causalSNPs = [s for s in bed.sid if 'csnp' in s]


Loading fam file dataset1.fam
Loading bim file dataset1.bim
bed file is open dataset1.bed

In [36]:
bed.sid[:5]


Out[36]:
array(['csnp18', 'csnp35', 'csnp59', 'csnp78', 'csnp85'], 
      dtype='|S8')

In [8]:
bed.iid


Out[8]:
array([['FAM1', 'person1'],
       ['FAM1', 'person2'],
       ['FAM1', 'person3'],
       ..., 
       ['FAM1', 'person998'],
       ['FAM1', 'person999'],
       ['FAM1', 'person1000']], 
      dtype='|S10')

In [39]:
print "bed matrix shape:", bed.val.shape
print "Size of bed matrix: %4.0fmb" %(bed.val.nbytes/(1024**2))
bed.val[0,:100]


bed matrix shape: (1000, 10499)
Size of bed matrix:   80mb
Out[39]:
array([ 1.02560486, -0.27215848, -0.95610942,  1.02534931, -1.51981533,
       -1.30964838,  0.63613624, -0.63517447,  0.57405621, -0.260105  ,
        0.83680296, -0.29262899,  0.81886119,  0.8546494 ,  0.98413493,
        0.68009803, -0.25883879,  0.41862282, -1.75595071, -0.35218036,
       -0.62280452,  0.37323019, -1.85210593,  0.38161617, -0.47674932,
        1.04471355,  0.409157  ,  1.25615244, -1.00629693,  0.32462797,
        0.51189572,  0.97311501, -1.07204019,  1.19379955,  0.8073641 ,
        0.96046928,  0.35903931, -0.12674527, -0.55584135, -1.60930407,
        0.60472281, -1.70265594,  0.40542603,  0.37102079,  0.47297522,
        1.29137371,  1.26291616,  1.31811137, -0.50942098,  0.3869888 ,
        0.76623177,  1.2044326 ,  1.31310913,  0.89309116, -0.98461312,
        0.80937703, -0.11675322, -1.4818986 , -0.59343479,  0.9486833 ,
        1.08114749,  0.65012832,  0.41768823,  0.71552269,  0.89087584,
       -0.96214778,  0.76100977,  1.25437309,  1.10277769,  0.71819384,
       -0.40231063, -1.4946982 ,  1.11931722, -2.00298107, -2.43748814,
       -0.69148984, -1.27778105,  0.44324371, -0.73352656, -2.21485187,
        1.29809898, -2.20839212, -0.36593953,  1.09450837,  0.82295589,
        0.63157195,  0.95644972, -1.18861002,  0.55217149, -0.39384837,
        0.52111151,  1.13288607, -0.01490193,  1.307993  , -0.65449807,
       -0.07539255, -1.91107868, -0.08229376, -0.11010442,  0.93603593])

In [26]:
f = Bed(bfile)
f.sid_count


Loading fam file dataset1.fam
Loading bim file dataset1.bim
bed file is open dataset1.bed
Out[26]:
10499

In [29]:
np.unique(f.read().val)


Out[29]:
array([ 0.,  1.,  2.])

Results:

In both Python 2.7 and Python 3.5 the modules that read the .bed files perpetually keep the .bed file open. The one in Python3.5 can close the connection though, pysnptools can't(at least couldn't find a way).

The object 'bed' from Bed(file).read().standardize() has a method bid.sid that enumerates all of the SNPS contained in the file. The equivalent in the plinkio is plinkio.open(file).get_locus() and then each item in locus contains a SNP and the information thereon, including which chromosome it comes from.

Decomposing leap Utils BED processing functions


In [10]:
import pysnptools.util as pstutil
import pysnptools.util.pheno as phenoUtils

In [9]:
pheno = phenoUtils.loadOnePhen('dataset1.phe', missing='-9', vectorize=True)


WARNING:root:loadPhen is using default missing value of '-9'.

In [19]:
pheno['iid']


Out[19]:
array([['FAM1', 'person1'],
       ['FAM1', 'person2'],
       ['FAM1', 'person3'],
       ..., 
       ['FAM1', 'person998'],
       ['FAM1', 'person999'],
       ['FAM1', 'person1000']], 
      dtype='|S10')

In [24]:
print("Size of pheno matrix: ",pheno['vals'].shape)
np.unique(pheno['vals'])


('Size of pheno matrix: ', (1000,))
Out[24]:
array([ 0.,  1.])

In [24]:
bed, pheno = pstutil.intersect_apply([f, pheno])
bed.sid_count


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-cc691a0fdac4> in <module>()
----> 1 bed, pheno = pstutil.intersect_apply([f, pheno])
      2 bed.sid_count

NameError: name 'f' is not defined

In [ ]:

Results

Apparently, the pheno['iid'] is suppose to match the bed.iid object. When the writers do "checkIntersection" they essentially check whether the persons and family of bed and pheno objects match up. If not, it throws an error. I presume this is how the two are matched up. Either way, "checkIntersection" doesn't return anything. I think pstutil.intersect_apply is the function that intersects the two files? Documentation isn't clear either.

It seems that plinkio reads in phenotype automatically and the phenotype can be read from the generator items in sample_list. However, it seems that the phenotypes returned by plinkio are approximately normal...which is strange and may be due to our test data file.

Information such as family, iid, and the like can also be accessed in a similar fashion as phenotypes


In [11]:
covarsDict = phenoUtils.loadPhen('dataset1.cov')


loadPhen is using default missing value of '-9'.

In [12]:
covarsDict.keys()


Out[12]:
['vals', 'header', 'iid']

In [13]:
covarsDict['iid']


Out[13]:
array([['FAM1', 'person1'],
       ['FAM1', 'person2'],
       ['FAM1', 'person3'],
       ..., 
       ['FAM1', 'person998'],
       ['FAM1', 'person999'],
       ['FAM1', 'person1000']], 
      dtype='|S18')

In [15]:
covarsDict['vals'][:10]


Out[15]:
array([[ 0.87259804],
       [-0.6069933 ],
       [-0.82491783],
       [-2.03853838],
       [ 1.84707776],
       [ 0.03316381],
       [-0.72371969],
       [-0.80086393],
       [ 0.91305371],
       [ 0.24719815]])

Covars

The .cov file is a tab delimited file with three columns: FAM, person, covariance. The loadPhen function just loads the stuff into a dictionary but this could be easily done into a regular pandas. (our goal)


In [ ]:
#related file is apparently generated and then is optionally in. Otherwise it's created from bed
#relatedDict = phenoUtils.loadOnePhen(relFile, vectorize=True)