In [1]:
# Add all necessary imports here
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.reload_library()
plt.style.use("ggplot")

Cover Slide 1

Phasing and Imputation

Local Clustering model in genomics

Student Thomas Dias Alves

Supervisor Michael Blum

Supervisor Julien Mairal

 

 

Problem

Phasing

Humans are diploid. We have two copies of each chromosome. The genotype is the sum of these two copies. Given the genotype for a number of individuals, we can find the haplotypes by haplotype phasing techniques.

Imputation

Due to the experimentation, there can be a lot of missing data. But many methods are only working for a full dataset.

State of the Art

Fastphase

Bayesian model and EM algorithm

An optimization formulation

$$ \begin{aligned} & \underset{A,S}{\text{minimize}} & & \sum_{i = 0}^{n-1} \sum_{j = 0}^{m-1} \| h_{i,j}^{1} - a_{s_{i,j}^{1},j}\|^2 + \| h_{i,j}^{2} - a_{s_{i,j}^{2},j}\|^2\\ &&& + \lambda \mathbb{1}_{s_{i,j}^{1} \neq s_{i,j-1}^{1}} + \lambda \mathbb{1}_{s_{i,j}^{2} \neq s_{i,j-1}^{2}} \\ & \text{subject to} & & a_{i,j} \in \left[0,1\right], h_{i,j} \in \lbrace 0, 1\rbrace, s_{i,j} \in \lbrace 0, 1, \ldots, k-1\rbrace\\ & & & g_{i,j} = h_{i,j}^{1} + h_{i,j}^{2} \end{aligned} $$

About the algorithm

  • An optimization problem
  • with constraints
  • graph algorithm : forward backward

Software

Results : Speed

Results : Imputation

Q&A Slide

Questions???