In [4]:
import os
import sys
from typing import *
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
WARNING: these instruction will create a genome index that is only suitable for use with 10x Chromium data. We create a composite index, with separate sequence fragments representing unspliced and spliced transcripts. Unspliced fragments will be generated from genomic sequence upstream of every poly-A stretch located inside of a gene locus. Spliced fragments will be generated from transcriptomic sequence upstream of every polyadenylation site in every known spliced transcript. This reflects the locations of reads generated by Chromium, but may not be suitable for other methods.
The instructions below show how to build a genome index for the human genome. For other species, the code below will have to be modified to accomodate different sources and formats of metadata for those species.
Install the BioPython package in your python distribution:
pip install biopython # Method recommended by the BioPython people
conda install -c conda-forge biopython # Alternative, for Anaconda
Install gawk (on Linux, this is typically already installed), bedtools, and kallisto (instructions below are for macOS using Brew):
brew install gawk
brew install bedtools
brew install kallisto
Get the "Genome sequence, primary assembly (GRCh38)” file named GRCh38.primary_assembly.genome.fa.gz
Download the comprehensive PRI gene annotation on the primary assembly GTF file named gencode.v31.primary_assembly.annotation.gtf
Get the “Complete HGNC dataset” as TXT, a file named hgnc_complete_set.txt
Chromium cell barcodes are generated from a set of known sequences, which differ by version of the Chromium kits. Get three files from the cellranger GitHub repository and rename them like this:
3M-february-2018.txt.gz -> (unzip) -> 10xv3_whitelist.txt
737K-august-2016.txt -> 10xv2_whitelist.txt
737K-april-2014_rc.txt -> 10xv1_whitelist.txt
Get the human_tfs_consensus.tab file from our Google Cloud. These metadata will be used to determine which genes are transcription factors, and to assign them to families.
Create a BED file with all the genes:
cat gencode.v31.primary_assembly.annotation.gtf | gawk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,0,$7}}' | tr -d '";' > gencode.v31.primary_assembly.annotation.bed
Create a fasta file of pre-mRNA (exons + introns) transcripts:
bedtools sort -i gencode.v31.primary_assembly.annotation.bed > gencode.v31.primary_assembly.annotation.sorted.bed
bedtools merge -i gencode.v31.primary_assembly.annotation.sorted.bed -s -c 4 -o collapse > gencode.v31.primary_assembly.annotation.merged.bed
bedtools getfasta -name -fo gencode.v31.unspliced.fa -fi GRCh38.primary_assembly.genome.fa -bed gencode.v31.primary_assembly.annotation.sorted.bed
A manifest file is necessary to tell cytograph how to find the relevant files in the index. Create a file manifest.json
with the following content:
{
"species": "Homo sapiens",
"index_file": "gencode.v31.fragments.idx",
"gene_metadata_file": "gencode.v31.metadata.tab",
"gene_metadata_key": "AccessionVersion",
"fragments_to_genes_file": "fragments2genes.txt",
"layers": {
"unspliced": "unspliced_fragments.txt",
"spliced": "spliced_fragments.txt"
}
}
The gene_metadata_key
indicated which column in the metadata file (gencode.v31.metadata.tab
) contains the gene IDs (as used in fragments2genes.txt
).
Finally, organize your files in a directory and subdirectory as follows:
10xv1_whitelist.txt
10xv2_whitelist.txt
10xv3_whitelist.txt
inputs/
GRCh38.primary_assembly.genome.fa
GRCh38.primary_assembly.genome.fa.fai
gencode.v31.unspliced.fa
gencode.v31.primary_assembly.annotation.bed
gencode.v31.primary_assembly.annotation.gtf
gencode.v31.primary_assembly.annotation.sorted.bed
gencode.v31.transcripts.fa
hgnc_complete_set.txt
human_tfs_consensus.txt
The inputs
directory will only be used to generate the index, and can be discarded afterwards.
Before starting, set d
to the full path of your index directory, and extent
to the desired window size (windows are the sequences upstream of a polyadenylation site or a genomic polyA/T sequence that are used to build the index).
In [ ]:
d = "/Users/stelin/kallisto_GRCh38/human_GRCh38_gencode.v31/"
extent = 600 # how many bases away from polya to include
min_len = 90 # how many non-repeat bases required to make a transcript
Now run each of the following code blocks:
In [2]:
# Make a gene metadata table from hgnc and gencode gtf
# First, load and index the HGNC annotations from ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
hgnc = {}
with open(d + "inputs/hgnc_complete_set.txt") as f:
hgnc_headers = f.readline()[:-1].split("\t")
for line in f:
items = line[:-1].split("\t")
hgnc[items[0]] = items
# Load human consensus TFs
tfs = {}
with open(d + "inputs/human_tfs_consensus.tab") as f:
tfs_headers = f.readline()[:-1].split("\t")
for line in f:
items = line[:-1].split("\t")
tfs[items[0]] = items
# Next load the gencode GTF and create a genome annotation tsv
with open(d + "gencode.v31.metadata.tab", "w") as fout:
fout.write("\t".join([
"Accession",
"AccessionVersion",
"Gene",
"FullName",
"GeneType",
"HgncID",
"Chromosome",
"Strand",
"ChromosomeStart",
"ChromosomeEnd",
"LocusGroup",
"LocusType",
"Location",
"LocationSortable",
"Aliases",
"VegaID",
"UcscID",
"RefseqID",
"CcdsID",
"UniprotID",
"PubmedID",
"MgdID",
"RgdID",
"CosmicID",
"OmimID",
"MirBaseID",
"IsTF",
"DnaBindingDomain"
]))
fout.write("\n")
with open(d + "inputs/gencode.v31.primary_assembly.annotation.gtf") as f:
for line in f:
if line.startswith("##"):
continue
items = line[:-1].split("\t")
if items[2] != "gene":
continue
extra = {x.strip().split(" ")[0]: x.strip().split(" ")[1].strip('"') for x in items[8].split(";")[:-1]}
if "hgnc_id" in extra and extra["hgnc_id"] in hgnc:
acc = extra["hgnc_id"]
ensemblID = extra["gene_id"].split(".")[0]
fout.write("\t".join([
ensemblID,
extra["gene_id"],
hgnc[acc][1], # gene symbol
hgnc[acc][2], # full name
extra["gene_type"], # gene type from gencode
acc, # hgnc ID
items[0], # Chromosome
items[3], # Start
items[4], # End
hgnc[acc][3], # Locus group
hgnc[acc][4], # Locus type
hgnc[acc][6], # Location
hgnc[acc][7], # Location, sortable
hgnc[acc][8], # Aliases
hgnc[acc][20], # VEGA id
hgnc[acc][21], # UCSC id
hgnc[acc][23], # Refseq id
hgnc[acc][24], # CCDS id
hgnc[acc][25], # Uniprot id
hgnc[acc][26], # Pubmed id
hgnc[acc][27], # MGD id
hgnc[acc][28], # RGD id
hgnc[acc][30], # COSMIC id
hgnc[acc][31], # OMIM id
hgnc[acc][32], # MIRbase id
"True" if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "False", # IsTF?
tfs[ensemblID][2] if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "" # DBD
]))
else:
ensemblID = extra["gene_id"].split(".")[0]
fout.write("\t".join([
ensemblID,
extra["gene_id"],
extra.get("gene_name", ""), # gene symbol
extra.get("gene_name", ""), # full name
extra["gene_type"], # gene type from gencode
"", # HGNC id
items[0], # Chromosome
items[6], # Strand
items[3], # Start
items[4], # End
"", # Locus group
"", # Locus type
"", # Location
"", # Location, sortable
"", # Aliases
"", # VEGA id
"", # UCSC id
"", # Refseq id
"", # CCDS id
"", # Uniprot id
"", # Pubmed id
"", # MGD id
"", # RGD id
"", # COSMIC id
"", # OMIM id
"", # MIRbase id
"True" if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "False", # IsTF?
tfs[ensemblID][2] if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "" # DBD
]))
fout.write("\n")
In [5]:
def find_polys(seq: SeqRecord, c: str = "A", n: int = 15) -> List[Tuple[int, int]]:
found = []
count = seq[:n].count(c) # Count occurences in the first k-mer
if count >= n - 1: # We have a match
found.append(0)
ix = 0
while ix < len(seq) - n - 1:
if seq[ix] == c: # Outgoing base
count -= 1
if seq[ix + n] == c: # Incoming base
count += 1
ix += 1
if count >= n - 1: # We have a match
found.append(ix)
sorted_by_lower_bound = [(f, f + n) for f in found]
# merge intervals (https://codereview.stackexchange.com/questions/69242/merging-overlapping-intervals)
merged = []
for higher in sorted_by_lower_bound:
if not merged:
merged.append(higher)
else:
lower = merged[-1]
# test for intersection between lower and higher:
# we know via sorting that lower[0] <= higher[0]
if higher[0] <= lower[1]:
upper_bound = max(lower[1], higher[1])
merged[-1] = (lower[0], upper_bound) # replace by merged interval
else:
merged.append(higher)
return merged
In [8]:
polyAs = {}
polyTs = {}
for fasta in SeqIO.parse(open(d + "inputs/gencode.v31.unspliced.fa"),'fasta'):
gene_id = fasta.id
intervals = find_polys(fasta.seq, c="A", n=14)
if len(intervals) > 0:
polyAs[gene_id] = intervals
# Collect fragments on the opposite strand, downstream of poly-Ts (not sure if such reads really happen?)
intervals = find_polys(fasta.seq, c="T", n=14)
if len(intervals) > 0:
polyTs[gene_id] = intervals
In [9]:
tr2g = {}
with open(d + "inputs/gencode.v31.primary_assembly.annotation.gtf") as f:
for line in f:
if "\ttranscript\t" in line:
items = line.split("; ")
chrom, _, _, start, end, _, strand, _, gid = items[0].split("\t")
gene_id = gid.split('"')[1]
transcript_id = items[1].split('"')[1]
gene_type = items[2].split('"')[1]
gene_name = items[3].split('"')[1]
tr2g[transcript_id] = (chrom, start, end, strand, gene_id, gene_type, gene_name)
In [10]:
count = 0
with open(d + "fragments2genes.txt", "w") as ftr2g:
with open(d + "inputs/gencode.v31.fragments.fa", "w") as fout:
# Write the nascent fragments, with one partial transcript per internal poly-A/T site
with open(d + "unspliced_fragments.txt", "w") as fucapture:
for fasta in SeqIO.parse(open(d + "inputs/gencode.v31.unspliced.fa"),'fasta'): # Note we're in the masked file now
gene_id = fasta.id
if gene_id in polyAs:
for interval in polyAs[gene_id]:
seq = str(fasta.seq[max(0, interval[0] - extent):interval[0]])
#seq = seq.translate(tr).strip("N")
if len(seq) >= min_len:
count += 1
transcript_id = f"{gene_id}.A{interval[0]}"
trseq = SeqRecord(Seq(seq), transcript_id, '', '')
fout.write(trseq.format("fasta"))
ftr2g.write(f"{transcript_id}\t{gene_id}\n")
fucapture.write(f"{transcript_id}\n")
if gene_id in polyTs:
for interval in polyTs[gene_id]:
seq = str(fasta.seq[interval[1]:interval[1] + extent].reverse_complement())
#seq = seq.translate(tr).strip("N")
if len(seq) >= min_len:
count += 1
transcript_id = f"{gene_id}.T{interval[0]}"
trseq = SeqRecord(Seq(seq), transcript_id, '', '')
fout.write(trseq.format("fasta"))
ftr2g.write(f"{transcript_id}\t{gene_id}\n")
fucapture.write(f"{transcript_id}\n")
# Write the mature fragments, covering the 3' end of each mature transcript
with open(d + "spliced_fragments.txt", "w") as fscapture:
for fasta in SeqIO.parse(open(d + "inputs/gencode.v31.transcripts.fa"),'fasta'): # Note we're in the masked file now
transcript_id = fasta.id.split("|")[0]
gene_id = fasta.id.split("|")[1]
attrs = tr2g[transcript_id]
seq = str(fasta.seq[-extent:])
if len(seq) >= min_len:
count += 1
trseq = SeqRecord(Seq(seq), f"{transcript_id}.{count} gene_id:{attrs[4]} gene_name:{attrs[6]}", '', '')
fout.write(trseq.format("fasta"))
ftr2g.write(f"{transcript_id}.{count}\t{attrs[4]}\n")
fscapture.write(f"{transcript_id}.{count}\n")
Run the following on the command line (it might take half an hour):
kallisto index -i gencode.v31.fragments.idx -k 31 inputs/gencode.v31.fragments.fa
You should now have the following directory structure:
10xv1_whitelist.txt
10xv2_whitelist.txt
10xv3_whitelist.txt
fragments2genes.txt
gencode.v31.fragments.idx
gencode.v31.metadata.tab
manifest.json
spliced_fragments.txt
unspliced_fragments.txt
inputs/
You can now remove the inputs/
subdirectory.
We created a composite kallisto index (gencode.v31.fragments.idx
), with separate sequence fragments for spliced and unspliced transcripts. The IDs of spliced and unspliced fragments are listed in the spliced_fragments.txt
and unspliced_fragments.txt
files, so that we can later count them separately. For example, here's the first few lines of unspliced_fragments.txt
(the last part of each ID, e.g. A14056
, indicates that this fragment is located upstream of a poly-A stretch at position 14056):
ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796
ENSG00000277400.1.A44325
ENSG00000277400.1.A45592
ENSG00000277400.1.A47356
ENSG00000277400.1.A49571
ENSG00000277400.1.A53084
ENSG00000277400.1.A53231
We also created a mapping from fragments to genes, fragments2genes.txt
so that we can later pool counts by gene. Here's the first few lines of that file:
ENSG00000277400.1.A14056 ENSG00000277400.1
ENSG00000277400.1.A32841 ENSG00000277400.1
ENSG00000277400.1.A35311 ENSG00000277400.1
ENSG00000277400.1.A36796 ENSG00000277400.1
ENSG00000277400.1.A44325 ENSG00000277400.1
ENSG00000277400.1.A45592 ENSG00000277400.1
ENSG00000277400.1.A47356 ENSG00000277400.1
ENSG00000277400.1.A49571 ENSG00000277400.1
ENSG00000277400.1.A53084 ENSG00000277400.1
ENSG00000277400.1.A53231 ENSG00000277400.1
Finally, we created a consolidated metadata file gencode.v31.metadata.tab
which collects a lot of useful annotation about each gene:
Accession # ENSEMBL accession
AccessionVersion # ENSEMBL accession.version
Gene # HGNC official gene symbol
FullName # Gene long name
GeneType # Like 'protein_coding', 'pseudogene', 'snRNA', 'rRNA', etc.
HgncID
Chromosome # Like 'chr1', 'chr2', 'chrM', etc.
ChromosomeStart # Integer start position of the gene
ChromosomeEnd # Integer end position of the gene
LocusGroup
LocusType
Location # Like '1p36.33'
LocationSortable # Like '01p36.33'
Aliases
VegaID
UcscID
RefseqID
CcdsID
UniprotID
PubmedID
MgdID
RgdID
CosmicID
OmimID
MirBaseID
IsTF # 'True' if the gene is a transcription factor
DnaBindingDomain # Like 'HMG/Sox', 'Homeodomain', etc.
Of these, only Accession
and Gene
are required by Cytograph.
In [ ]: