In this tutorial we are going to use a GWAS dataset (accessible from this link) together with the whole ENCODE BroadPeak dataset to find which mutations (and their associated traits) are most represented in enhancer regions which are present in a limited set of cell lines.
As first thing let's download the data and reformat into a bed-like file.
In [1]:
%%bash
wget -q https://www.ebi.ac.uk/gwas/api/search/downloads/full -O tmp.tsv
cat tmp.tsv | awk 'BEGIN {FS="\t";OFS="\t"} {chrom=$12; gsub(chrom,"chr"chrom,$12)}{print $0}' | sed s/,//g > gwas.tsv
rm tmp.tsv
In [2]:
import gmql as gl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
In [3]:
! head -1 gwas.tsv | tr "\t" "\n"
We are mainly interested in the mutation position (11-th and 12-th column) and the associated trait (7-th).
In [4]:
gwas = gl.load_from_file("./gwas.tsv",
parser=gl.parsers.RegionParser(chrPos=11,
startPos=12,
stopPos=12,
otherPos=[(7, "trait", 'string')]))
In [5]:
gwas.head().regs
Out[5]:
We can also simply look at the schema
In [6]:
gwas.schema
Out[6]:
In [7]:
gwas_data = gwas.materialize().regs
We now plot the number of regions for each of the top 30 represented traits.
In [8]:
plt.figure(figsize=(20,5))
sns.countplot(data=gwas_data[gwas_data.trait.isin(gwas_data.trait.value_counts().iloc[:30].index)], x='trait')
plt.xticks(rotation=90)
plt.title("Top represented GWAS traits", fontsize=20)
plt.show()
If the data come already in the GDM format, they can be loaded using the load_from_path
function. A GDM dataset is stored as a folder having the following structure:
/path/to/dataset/:
- sample1.gdm
- sample1.gdm.meta
- sample2.gdm
- sample2.gdm.meta
- ...
- schema.xml
In this case, the parser for the data is automatically inferred by the library from the schema.xml
file.
In [9]:
broad = gl.load_from_path("../data/HG19_ENCODE_BROAD/")
In [10]:
broad.schema
Out[10]:
In [11]:
acetyl = broad[broad['experiment_target'] == 'H3K27ac-human']
We get the peak region of the Chip-Seq using the reg_project
function. The peak position (peak
) is given by the center of the region.
In [12]:
peaked = acetyl.reg_project(new_field_dict={'peak': (acetyl.right + acetyl.left)/2})
Once we have the peak, we extend the search region to $\pm 1500 bp$. We use again reg_project
In [13]:
enlarge = peaked.reg_project(new_field_dict={'left': peaked.peak - 1500, 'right': peaked.peak + 1500})
We are interested in enhancers which are cell line specific. Therefore it is important to group our data by cell line. In addition to this we merge the signals coming from different tracks for the same cell line. We can do both of these actions using the normal_cover
function.
As output of the following command, we have a dataset that contains a single track for each cell line. The track is computed merging the replicas of the experiment targeting H3K9ac for the same cell line.
In [14]:
enhancers_by_cell_line = enlarge.normal_cover(1, "ANY", groupBy=['biosample_term_name'])
To select only the cell line specific enhancers we can now apply again normal_cover
and constraining the maximum number of overlaps between the regions to be a selected threshold.
In this case we select a threshold of 2.
In this case the output contains a single sample with all the enhancers which are present in at most max_overlapping
cell lines.
In [15]:
max_overlapping = 2
cell_specific_enhancers = enhancers_by_cell_line.normal_cover(1, max_overlapping)
In [16]:
cell_specific_enhancers.schema
Out[16]:
Finally, we need to re-associate every cell specific enhancers in cell_specific_enhancers
to all the max_overlapping
cell lines in which it is present.
Therefore, we used a join
to select, for each cell line, only those enchancers that overlap a region in the cell_specific_enhancers
.
In [17]:
cell_specific_enhancers_by_cell_line = enhancers_by_cell_line.join(cell_specific_enhancers, [gl.DLE(0)], 'left', refName="en", expName="csen")
In [18]:
gwas.schema
Out[18]:
In [19]:
enhancer_gwas = cell_specific_enhancers_by_cell_line.map(gwas, refName="csen", expName="gwas", new_reg_fields={'traits': gl.BAG('trait')})
enhancer_gwas = enhancer_gwas.reg_project(["count_csen_gwas", "traits"],new_field_dict={'cell_line': enhancer_gwas['csen.en.biosample_term_name', 'string']})
In [20]:
enhancer_gwas = enhancer_gwas.materialize()
The traits
column of the resulting region is the list of traits associated with the cell specific enhancer. The data comes in the form of a string of trait names.
We convert the string to a list.
In [21]:
enhancer_gwas.regs['traits'] = enhancer_gwas.regs.traits.map(lambda x: x.split(",") if pd.notnull(x) else x)
The final part of the analysis regards the matching of cell lines and traits. We want to understand if a cell line (which is represented by its specific enhancers) has some particular mutation trait associated.
The analysis is performed in Pandas using the result region attributes traits
and cell_line
.
We build an association matrix between cell lines and traits by firstly converting the result to a list of (cell_line, trait)
, converting it to a Pandas DataFrame, and finally using the crosstab
Pandas function to extract the matrix.
In [22]:
cell_trait = pd.DataFrame.from_records([(k, v) for k, vs in enhancer_gwas.regs[enhancer_gwas.regs.count_csen_gwas > 0].groupby("cell_line").traits.sum().to_dict().items() for v in vs],
columns=['cell_line', 'trait'])
In [23]:
cross = pd.crosstab(cell_trait.cell_line, cell_trait.trait)
We finally plot the result as an heatmap.
In [24]:
plt.figure(figsize=(50, 15))
sns.heatmap(cross[cross.sum(0).sort_values(ascending=False).iloc[:100].index], cmap='Reds', vmax=70, linewidths=1, annot=True, cbar=False)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.xlabel("Trait", fontsize=30)
plt.ylabel("Cell line", fontsize=30)
plt.show()