Currently, our code uses the newer plinkio python library which enables us to read the files in line by line (or chunks) and insert the data into Numpy and subsequently Pandas dataframe. Although, currently we are dealing with a rather small dataset (1,000 by 10,500), for larger datasets we plan to implement the HDF5 protocol using PyTables and Pandas.
The advantage of Pandas is the ability to view and slice data without making copies. Moreover, we can embed identification, chromosome, and SNP information which yields to more efficient indexing and eliminates the need to constantly check the intersection between numerous plink files and bed file contents.
The largest obstacle to quick implementation of the LEAP algorithm are its present dependencies on Fortran libraries and compilers through its use of fastLMM, scipy.linalg.blas, and other obscurities. Consequently, one of our goals is to be able to run LEAP efficiently without recourse to these elements. However, more importantly is an inability to directly replicate the original LEAP code due to our inability to use some functions or read the code. In the latter case, the fact that functions are scattered about 3 or 4 separate files and do not include INPUT/OUTPUT descriptions made the code painstakingly difficult to compherend. Unfortunately, the paper describing the algorithm is equally good at obscuring many of these details.
Then, of course, there is the former issue. The code, for example, utilizes the Fortran blas.dsyrk function to quickly compute the $XX^T$ matrix or the unnormalized covariance matrix where $X$ is a $n (\text{num of subjects})\times p (\text{number of SNPs})$ matrix of SNP values (0, 1,2 in dataset1) normalized to have a mean 0. Since we could not implement it, we used np.dot($X,X^t$) -2 times faster than np.cov- to compute $XX^T$ and then estimate the covariance matrix by dividing the above by the number of SNPs. The joy, however, doesn't stop there. It is, apparently, a well known issue that the empirical covariance estimate does not serve well for the eigenvalue decomposition of the matrix and instead we should use shrunkage covariance estimators. To this end, we compared the following off the shelf algorithms in Scipy: Oracle Approximating Shrinkage(OAS) and Ledoit-Wolf shrinkage estimators while skipping the cross-validation option due to fears of computational intractability. The OAS estimator yielded the closest covariance matrix values to the ones in the original LEAP algorithm while the Ledoit-Wolf estimator heavily underestimated them. Consequently, we stuck with OAS.
Perhaps the best approach would be to simply compute an SVD decomposition directly instead of using the relationship between the eigendecomposition of the covariance matrix and the SVD decomposition. However, this may not be computationally feasible in the very large dimensional problem involved here.
Although I understand the algorithms and algorithms used in LEAP, a severe limitation involved my lack of biological background. Perhaps a better high-level understanding of the process would have facilitated the implementation of LEAP.
In [ ]: