In [4]:
import os
import sys
from typing import *
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

Building a genome index for use with loompy and cytograph

WARNING: these instruction will create a genome index that is only suitable for use with 10x Chromium data. We create a composite index, with separate sequence fragments representing unspliced and spliced transcripts. Unspliced fragments will be generated from genomic sequence upstream of every poly-A stretch located inside of a gene locus. Spliced fragments will be generated from transcriptomic sequence upstream of every polyadenylation site in every known spliced transcript. This reflects the locations of reads generated by Chromium, but may not be suitable for other methods.

The instructions below show how to build a genome index for the human genome. For other species, the code below will have to be modified to accomodate different sources and formats of metadata for those species.

Install prerequisites

Install the BioPython package in your python distribution:

pip install biopython  # Method recommended by the BioPython people
conda install -c conda-forge biopython  # Alternative, for Anaconda

Install gawk (on Linux, this is typically already installed), bedtools, and kallisto (instructions below are for macOS using Brew):

brew install gawk
brew install bedtools
brew install kallisto

Download genome data

Download transcript sequences and metadata from Gencode

Get the "Genome sequence, primary assembly (GRCh38)” file named GRCh38.primary_assembly.genome.fa.gz

Download the comprehensive PRI gene annotation on the primary assembly GTF file named gencode.v31.primary_assembly.annotation.gtf

Download additional gene metadata from HGNC

Get the “Complete HGNC dataset” as TXT, a file named hgnc_complete_set.txt

Download 10x Chromium barcode whitelists

Chromium cell barcodes are generated from a set of known sequences, which differ by version of the Chromium kits. Get three files from the cellranger GitHub repository and rename them like this:

3M-february-2018.txt.gz -> (unzip) -> 10xv3_whitelist.txt
737K-august-2016.txt -> 10xv2_whitelist.txt
737K-april-2014_rc.txt -> 10xv1_whitelist.txt

Download the human transcription factor motif database

Get the human_tfs_consensus.tab file from our Google Cloud. These metadata will be used to determine which genes are transcription factors, and to assign them to families.

Preprocess the genome data

Create a BED file with all the genes:

cat gencode.v31.primary_assembly.annotation.gtf |  gawk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,0,$7}}' | tr -d '";' > gencode.v31.primary_assembly.annotation.bed

Create a fasta file of pre-mRNA (exons + introns) transcripts:

bedtools sort -i gencode.v31.primary_assembly.annotation.bed > gencode.v31.primary_assembly.annotation.sorted.bed

bedtools merge -i gencode.v31.primary_assembly.annotation.sorted.bed -s -c 4 -o collapse > gencode.v31.primary_assembly.annotation.merged.bed

bedtools getfasta -name -fo gencode.v31.unspliced.fa -fi GRCh38.primary_assembly.genome.fa -bed gencode.v31.primary_assembly.annotation.sorted.bed

Create the index manifest file, and an index directory

A manifest file is necessary to tell cytograph how to find the relevant files in the index. Create a file manifest.json with the following content:

{
    "species": "Homo sapiens",
    "index_file": "gencode.v31.fragments.idx",
    "gene_metadata_file": "gencode.v31.metadata.tab",
    "gene_metadata_key": "AccessionVersion",
    "fragments_to_genes_file": "fragments2genes.txt",
    "layers": {
        "unspliced": "unspliced_fragments.txt",
        "spliced": "spliced_fragments.txt"
    }
}

The gene_metadata_key indicated which column in the metadata file (gencode.v31.metadata.tab) contains the gene IDs (as used in fragments2genes.txt).

Finally, organize your files in a directory and subdirectory as follows:

10xv1_whitelist.txt
10xv2_whitelist.txt
10xv3_whitelist.txt
inputs/
    GRCh38.primary_assembly.genome.fa
    GRCh38.primary_assembly.genome.fa.fai
    gencode.v31.unspliced.fa
    gencode.v31.primary_assembly.annotation.bed
    gencode.v31.primary_assembly.annotation.gtf
    gencode.v31.primary_assembly.annotation.sorted.bed
    gencode.v31.transcripts.fa
    hgnc_complete_set.txt
    human_tfs_consensus.txt

The inputs directory will only be used to generate the index, and can be discarded afterwards.

Run the following scripts

Before starting, set d to the full path of your index directory, and extent to the desired window size (windows are the sequences upstream of a polyadenylation site or a genomic polyA/T sequence that are used to build the index).


In [ ]:
d = "/Users/stelin/kallisto_GRCh38/human_GRCh38_gencode.v31/"
extent = 600  # how many bases away from polya to include
min_len = 90  # how many non-repeat bases required to make a transcript

Now run each of the following code blocks:


In [2]:
# Make a gene metadata table from hgnc and gencode gtf
# First, load and index the HGNC annotations from ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
hgnc = {}
with open(d + "inputs/hgnc_complete_set.txt") as f:
	hgnc_headers = f.readline()[:-1].split("\t")
	for line in f:
		items = line[:-1].split("\t")
		hgnc[items[0]] = items

# Load human consensus TFs
tfs = {}
with open(d + "inputs/human_tfs_consensus.tab") as f:
	tfs_headers = f.readline()[:-1].split("\t")
	for line in f:
		items = line[:-1].split("\t")
		tfs[items[0]] = items

# Next load the gencode GTF and create a genome annotation tsv
with open(d + "gencode.v31.metadata.tab", "w") as fout:
	fout.write("\t".join([
		"Accession",
		"AccessionVersion",
		"Gene",
		"FullName",
		"GeneType",
		"HgncID",
		"Chromosome",
		"Strand",
		"ChromosomeStart",
		"ChromosomeEnd",
		"LocusGroup",
		"LocusType",
		"Location",
		"LocationSortable",
		"Aliases",
		"VegaID",
		"UcscID",
		"RefseqID",
		"CcdsID",
		"UniprotID",
		"PubmedID",
		"MgdID",
		"RgdID",
		"CosmicID",
		"OmimID",
		"MirBaseID",
		"IsTF",
		"DnaBindingDomain"
	]))
	fout.write("\n")
	with open(d + "inputs/gencode.v31.primary_assembly.annotation.gtf") as f:
		for line in f:
			if line.startswith("##"):
				continue
			items = line[:-1].split("\t")
			if items[2] != "gene":
				continue
			extra = {x.strip().split(" ")[0]: x.strip().split(" ")[1].strip('"') for x in items[8].split(";")[:-1]}
			if "hgnc_id" in extra and extra["hgnc_id"] in hgnc:
				acc = extra["hgnc_id"]
				ensemblID = extra["gene_id"].split(".")[0]
				fout.write("\t".join([
					ensemblID,
					extra["gene_id"],
					hgnc[acc][1],  # gene symbol
					hgnc[acc][2],  # full name
					extra["gene_type"],  # gene type from gencode
					acc,  # hgnc ID
					items[0],  # Chromosome
					items[3],  # Start
					items[4],  # End
					hgnc[acc][3],  # Locus group
					hgnc[acc][4],  # Locus type
					hgnc[acc][6],  # Location
					hgnc[acc][7],  # Location, sortable
					hgnc[acc][8],  # Aliases
					hgnc[acc][20],  # VEGA id
					hgnc[acc][21],  # UCSC id
					hgnc[acc][23],  # Refseq id
					hgnc[acc][24],  # CCDS id
					hgnc[acc][25],  # Uniprot id
					hgnc[acc][26],  # Pubmed id
					hgnc[acc][27],  # MGD id
					hgnc[acc][28],  # RGD id
					hgnc[acc][30],  # COSMIC id
					hgnc[acc][31],  # OMIM id
					hgnc[acc][32],  # MIRbase id
					"True" if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "False", # IsTF?
					tfs[ensemblID][2] if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "" # DBD
				]))
			else:
				ensemblID = extra["gene_id"].split(".")[0]
				fout.write("\t".join([
					ensemblID,
					extra["gene_id"],
					extra.get("gene_name", ""),  # gene symbol
					extra.get("gene_name", ""),  # full name
					extra["gene_type"],  # gene type from gencode
					"",  # HGNC id
					items[0],  # Chromosome
					items[6],  # Strand
					items[3],  # Start
					items[4],  # End
					"",  # Locus group
					"",  # Locus type
					"",  # Location
					"",  # Location, sortable
					"",  # Aliases
					"",  # VEGA id
					"",  # UCSC id
					"",  # Refseq id
					"",  # CCDS id
					"",  # Uniprot id
					"",  # Pubmed id
					"",  # MGD id
					"",  # RGD id
					"",  # COSMIC id
					"",  # OMIM id
					"",  # MIRbase id
					"True" if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "False", # IsTF?
					tfs[ensemblID][2] if (ensemblID in tfs and tfs[ensemblID][3] == "Yes") else "" # DBD
				]))
			fout.write("\n")

In [5]:
def find_polys(seq: SeqRecord, c: str = "A", n: int = 15) -> List[Tuple[int, int]]:
	found = []
	count = seq[:n].count(c)  # Count occurences in the first k-mer
	if count >= n - 1:  # We have a match
		found.append(0)
	ix = 0
	while ix < len(seq) - n - 1:
		if seq[ix] == c:  # Outgoing base
			count -= 1
		if seq[ix + n] == c:  # Incoming base
			count += 1
		ix += 1
		if count >= n - 1:  # We have a match
			found.append(ix)
	
	sorted_by_lower_bound = [(f, f + n) for f in found]
	# merge intervals (https://codereview.stackexchange.com/questions/69242/merging-overlapping-intervals)
	merged = []
	for higher in sorted_by_lower_bound:
		if not merged:
			merged.append(higher)
		else:
			lower = merged[-1]
			# test for intersection between lower and higher:
			# we know via sorting that lower[0] <= higher[0]
			if higher[0] <= lower[1]:
				upper_bound = max(lower[1], higher[1])
				merged[-1] = (lower[0], upper_bound)  # replace by merged interval
			else:
				merged.append(higher)
	return merged

In [8]:
polyAs = {}
polyTs = {}
for fasta in SeqIO.parse(open(d + "inputs/gencode.v31.unspliced.fa"),'fasta'):
	gene_id = fasta.id
	intervals = find_polys(fasta.seq, c="A", n=14)
	if len(intervals) > 0:
		polyAs[gene_id] = intervals
	# Collect fragments on the opposite strand, downstream of poly-Ts (not sure if such reads really happen?)
	intervals = find_polys(fasta.seq, c="T", n=14)
	if len(intervals) > 0:
		polyTs[gene_id] = intervals

In [9]:
tr2g = {}
with open(d + "inputs/gencode.v31.primary_assembly.annotation.gtf") as f:
	for line in f:
		if "\ttranscript\t" in line:
			items = line.split("; ")
			chrom, _, _, start, end, _, strand, _, gid = items[0].split("\t")
			gene_id = gid.split('"')[1]
			transcript_id = items[1].split('"')[1]
			gene_type = items[2].split('"')[1]
			gene_name = items[3].split('"')[1]
			tr2g[transcript_id] = (chrom, start, end, strand, gene_id, gene_type, gene_name)

In [10]:
count = 0
with open(d + "fragments2genes.txt", "w") as ftr2g:
	with open(d + "inputs/gencode.v31.fragments.fa", "w") as fout:
		# Write the nascent fragments, with one partial transcript per internal poly-A/T site
		with open(d + "unspliced_fragments.txt", "w") as fucapture:
			for fasta in SeqIO.parse(open(d + "inputs/gencode.v31.unspliced.fa"),'fasta'):  # Note we're in the masked file now
				gene_id = fasta.id
				if gene_id in polyAs:
					for interval in polyAs[gene_id]:
						seq = str(fasta.seq[max(0, interval[0] - extent):interval[0]])
						#seq = seq.translate(tr).strip("N")
						if len(seq) >= min_len:
							count += 1
							transcript_id = f"{gene_id}.A{interval[0]}"
							trseq = SeqRecord(Seq(seq), transcript_id, '', '')
							fout.write(trseq.format("fasta"))
							ftr2g.write(f"{transcript_id}\t{gene_id}\n")
							fucapture.write(f"{transcript_id}\n")
				if gene_id in polyTs:
					for interval in polyTs[gene_id]:
						seq = str(fasta.seq[interval[1]:interval[1] + extent].reverse_complement())
						#seq = seq.translate(tr).strip("N")
						if len(seq) >= min_len:
							count += 1
							transcript_id = f"{gene_id}.T{interval[0]}"
							trseq = SeqRecord(Seq(seq), transcript_id, '', '')
							fout.write(trseq.format("fasta"))
							ftr2g.write(f"{transcript_id}\t{gene_id}\n")
							fucapture.write(f"{transcript_id}\n")
		# Write the mature fragments, covering the 3' end of each mature transcript
		with open(d + "spliced_fragments.txt", "w") as fscapture:
			for fasta in SeqIO.parse(open(d + "inputs/gencode.v31.transcripts.fa"),'fasta'):  # Note we're in the masked file now
				transcript_id = fasta.id.split("|")[0]
				gene_id = fasta.id.split("|")[1]
				attrs = tr2g[transcript_id]
				seq = str(fasta.seq[-extent:])
				if len(seq) >= min_len:
					count += 1
					trseq = SeqRecord(Seq(seq), f"{transcript_id}.{count} gene_id:{attrs[4]} gene_name:{attrs[6]}", '', '')
					fout.write(trseq.format("fasta"))
					ftr2g.write(f"{transcript_id}.{count}\t{attrs[4]}\n")
					fscapture.write(f"{transcript_id}.{count}\n")

Build the kallisto index

Run the following on the command line (it might take half an hour):

kallisto index -i gencode.v31.fragments.idx -k 31 inputs/gencode.v31.fragments.fa

You should now have the following directory structure:

10xv1_whitelist.txt
10xv2_whitelist.txt
10xv3_whitelist.txt
fragments2genes.txt
gencode.v31.fragments.idx
gencode.v31.metadata.tab
manifest.json
spliced_fragments.txt
unspliced_fragments.txt
inputs/

You can now remove the inputs/ subdirectory.

What just happened?

We created a composite kallisto index (gencode.v31.fragments.idx), with separate sequence fragments for spliced and unspliced transcripts. The IDs of spliced and unspliced fragments are listed in the spliced_fragments.txt and unspliced_fragments.txt files, so that we can later count them separately. For example, here's the first few lines of unspliced_fragments.txt (the last part of each ID, e.g. A14056, indicates that this fragment is located upstream of a poly-A stretch at position 14056):

ENSG00000277400.1.A14056
ENSG00000277400.1.A32841
ENSG00000277400.1.A35311
ENSG00000277400.1.A36796
ENSG00000277400.1.A44325
ENSG00000277400.1.A45592
ENSG00000277400.1.A47356
ENSG00000277400.1.A49571
ENSG00000277400.1.A53084
ENSG00000277400.1.A53231

We also created a mapping from fragments to genes, fragments2genes.txt so that we can later pool counts by gene. Here's the first few lines of that file:

ENSG00000277400.1.A14056    ENSG00000277400.1
ENSG00000277400.1.A32841    ENSG00000277400.1
ENSG00000277400.1.A35311    ENSG00000277400.1
ENSG00000277400.1.A36796    ENSG00000277400.1
ENSG00000277400.1.A44325    ENSG00000277400.1
ENSG00000277400.1.A45592    ENSG00000277400.1
ENSG00000277400.1.A47356    ENSG00000277400.1
ENSG00000277400.1.A49571    ENSG00000277400.1
ENSG00000277400.1.A53084    ENSG00000277400.1
ENSG00000277400.1.A53231    ENSG00000277400.1

Finally, we created a consolidated metadata file gencode.v31.metadata.tab which collects a lot of useful annotation about each gene:

Accession          # ENSEMBL accession
AccessionVersion   # ENSEMBL accession.version
Gene               # HGNC official gene symbol
FullName           # Gene long name
GeneType           # Like 'protein_coding', 'pseudogene', 'snRNA', 'rRNA', etc.
HgncID
Chromosome         # Like 'chr1', 'chr2', 'chrM', etc.
ChromosomeStart    # Integer start position of the gene
ChromosomeEnd      # Integer end position of the gene
LocusGroup
LocusType
Location           # Like '1p36.33'
LocationSortable   # Like '01p36.33'
Aliases
VegaID
UcscID
RefseqID
CcdsID
UniprotID
PubmedID
MgdID
RgdID
CosmicID
OmimID
MirBaseID
IsTF                # 'True' if the gene is a transcription factor
DnaBindingDomain    # Like 'HMG/Sox', 'Homeodomain', etc.

Of these, only Accession and Gene are required by Cytograph.


In [ ]: