In this notebook, we'll find ICE and delimit them in the Haemophilus influenzae species
Approximate time ot run this notebook: ~15-20 minutes.
NB: The pipeline here works because the 3 ICEs we will found are in the same spot. But often, ICEs are found in different spots, and it implies to deal with that information to not mix everything.
One needs to download the data, let's to it with python
In [2]:
mkdir Sequences
In [3]:
mkdir Sequences/Replicon
In [4]:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "me@my_institute.org" #tells ncbi who you are.
I download the sequence from the NC number of 10 Haemophilus influenzae
In [5]:
for nc in ["NC_000907", "NC_007146", "NC_009566", "NC_009567", "NC_014920", "NC_014922", "NC_016809", "NC_017452", "NC_017451", "NC_022356"]:
handle = Entrez.esearch(db="nucleotide", term=nc, rettype="gb", retmode="text")
record = Entrez.read(handle)
if len(record["IdList"]) == 0: # if no hit in genbank
print nc
handle2 = Entrez.efetch(db="nucleotide", id=record["IdList"][0], rettype="fasta", retmode="text")
replicons = SeqIO.read(handle2, "fasta")
SeqIO.write(replicons,
"Sequences/Replicon/" + nc +".fst",
"fasta")
In [6]:
!wc -l Sequences/Replicon/*
$ brew tap homebrew/science
$ brew update
$ brew install prokka --HEAD
In [7]:
mkdir Sequences/Annotations
In [8]:
%%bash
i=1
for nc in "NC_000907" "NC_007146" "NC_009566" "NC_009567" "NC_014920" "NC_014922" "NC_016809" "NC_017452" "NC_017451" "NC_022356"; do
printf -v j "%03d" $i
echo HAIN"$j"
prokka --kingdom Bacteria \
--genus Haemophilus Sequences/Replicon/"$nc".fst \
--outdir Sequences/Annotations/HAIN"$j" \
--locustag HAIN"$j" \
--prefix HAIN"$j" \
--cpus 0 >> LOG_prokka 2>&1
i=$((i+1))
done
With Roary
$ brew tap homebrew/science
$ brew install bedtools cd-hit blast mcl parallel prank mafft fasttree cpanm
$ sudo cpanm -f Bio::Roary
Move all .gff files in the same folder so Roary can work with them
In [9]:
mkdir Sequences/GFF
In [10]:
!cp Sequences/Annotations/*/*gff Sequences/GFF/
In [11]:
mkdir Sequences/Pangenome
In [12]:
!roary -p 20 -f Sequences/Pangenome/ Sequences/GFF/*.gff > LOG_roary 2>&1
Get core-genes
BEWARE to change the random number generated by roary (here _1494846018)
In [39]:
!query_pan_genome -g Sequences/Pangenome/_1494846018/clustered_proteins -a intersection Sequences/GFF/*.gff
In [40]:
mv pan_genome_results core_genome.txt
In [41]:
!tail core_genome.txt
In [42]:
!wc -l core_genome.txt
Remove first columns
In [43]:
!awk -F":" '{print $2}' core_genome.txt > core_genome_2.txt
I downloaded the module CONJScan with model and profiles for conjugative systems to use with macsyfinder. I put it under a Conjugation folder and rename it "Models"
In [18]:
mkdir Conjugation
In [19]:
ls Conjugation/
In [20]:
%%bash
for seq in Sequences/Annotations/*/*faa; do
name=${seq##*/}
for conj_type in typeF typeB typeC typeFATA typeFA typeG typeI typeT; do
macsyfinder "$conj_type" \
-w 20 \
--db-type ordered_replicon \
-d Conjugation/Models/DEF \
-p Conjugation/Models/HMM \
--profile-suffix .HMM \
--sequence-db "$seq" \
-o Conjugation/${name%%.*}\_"$conj_type" >> /dev/null # stdout is already reported in output files from macsyfinder (macsyfinder.out)
done
done
In [21]:
!wc Conjugation/*/macsyfinder.report
We have 3 ICEs of type G in strain 2, 7 and 9
In [22]:
!awk 'NR==1||FNR>1' Conjugation/*/macsyfinder.report > Conj_results
In [23]:
import pandas as pd
import numpy as np
In [24]:
conj = pd.read_table("Conj_results")
In [44]:
core = pd.read_table("core_genome_2.txt", header=None, names=["HAIN{:03d}".format(i) for i in range(1,11)])
Reformat Macsyfinder output, replace UserReplicon by the name of the replicon. (I may have missed an option)
In [45]:
conj["Replicon_name"] = conj["#Hit_Id"].apply(lambda x: x[:7])
In [46]:
conj.System_Id.replace("UserReplicon", conj.Replicon_name, regex=True, inplace=1)
In [47]:
conj.head()
Out[47]:
To identify the flanking core genes, I sort the column in the table core corresponding to the genome where I have an ICE, and find the index where I would insert any genes of that ICE
In [48]:
flanking_core = {}
for ice in conj.groupby("System_Id"):
data = ice[1]
replicon = data.Replicon_name.unique()[0]
core.sort_values(replicon, inplace=1)
spots = np.unique([core[replicon].searchsorted(gene)[0] for gene in data["#Hit_Id"]])
assert len(spots)==1 # make sure that all genes of the conjugative system insert at the same spot
flanking_core[ice[0]] = core.iloc[spots[0]-1:spots[0]+1].index
In [50]:
core.sort_index(inplace=1) # reset the order of the table
In [51]:
flanking_core
Out[51]:
In [52]:
flanking_core_ID = core.loc[flanking_core.values()[0]]
In [56]:
flanking_core_ID
Out[56]:
We see that they are all in the same spot, inbetween core genes 221 and 783, and that strain number 9 has been sequenced on the complementary strand or has a synteny breakpoint at that position. Let's eliminate that latter possibility.
There is a breakpoint if the two flanking coregenes in the list of core genes are not continguous (in case of a inversion of one of the two coregenes with a third one for instance). Meaning that the difference between their rank is either 1, -1 or the number of coregenes (-1) (to wrap around).
Getting their rank
In [53]:
order = core.values.argsort(axis=0)
ranks = order.argsort(axis=0)
In [57]:
for k,v in flanking_core.iteritems():
rank_diff = np.abs(np.diff(ranks[[v[0], v[1]]], axis=0)[0])
bkp = np.where((rank_diff!=1) & (rank_diff!=len(core)-1))[0]
if len(bkp)>0:
end_sentence = "is breakpoint for strain {}".format(bkp)
else:
end_sentence = "is no breakpoint in any strain"
print "The ICE {} is in the spot flanked by coregenes {} and {} and there {}".format(k, v[0], v[1], end_sentence)
Here there is no breakpoint in any the strains at that spot. In case there is a breakpoint, one should remove the corresponding strain of the analysis.
Let's get the information on the relative sequencing orientation so we can normalize the strand position of the genes. We fix the reference with respect to the strain 2 (arbitrary, the first strain with an ICE)
In [58]:
sequencing_orientation = {i:j for i,j in zip(core.columns, np.median(np.diff(ranks[ranks[:,1].argsort()], axis=0), axis=0))}
In [59]:
sequencing_orientation
Out[59]:
So the first two strain and the 7th and 10th are in the same median orientation, other strain will need to be flipped. We took the median orientation so it's not garanteed that it will work, as the ICE might be in a bigger region that have been inversed
We now have the 2 flanking core genes. We'll gather all the genes that are in the spot and cluster them with uclust to make gene families.
In [60]:
np.unique(flanking_core.values())
Out[60]:
In [61]:
good_core = core.loc[np.unique(flanking_core.values())]
good_core
Out[61]:
Make list of gene for each interval
In [62]:
mkdir Spots
In [63]:
for strain in good_core.columns:
i = [v.strip() for v in good_core[strain].values]
f,l = int(i[0][-5:]), int(i[1][-5:])
s = SeqIO.index("Sequences/Annotations/{strain}/{strain}.faa".format(strain=strain), "fasta")
seqs = [s.get("{}{:05d}".format(i[0][:8], j)) for j in range(min(f,l), max(f,l)+1)]
SeqIO.write([seq for seq in seqs if seq is not None],
"Spots/{}_spot.faa".format(strain),
"fasta")
Here we only have 1 spot, but one should keep track of the spots
In [64]:
!grep -c ">" Spots/*faa
In [65]:
!cat Spots/*faa > Spots/All_prot_spot.prt
In [66]:
mkdir Spots/Clusters
In [67]:
!usearch -quiet -cluster_fast Spots/All_prot_spot.prt \
-id 0.7\
-centroids Spots/All_prot_spot-70.fasta \
-uc Spots/All_prot_spot-70.uc > Spots/log_usearch
In [68]:
cluster = pd.read_table("Spots/All_prot_spot-70.uc", header=None)
In [69]:
families = cluster[cluster[0]!="C"].groupby(1)[8].apply(lambda x: sorted(x)).values
In [70]:
fam = pd.DataFrame([sorted(f) for f in families])
In [72]:
fam.sort_values(by=fam.columns.tolist(), inplace=1)
fam.sort_values(by=fam.index.tolist(), axis=1, inplace=1)
We use and old script I wrote which is quite ugly to read, but works. You can find it here: spot_ICE
The advantage is that it produces an interactive plot ot the spot, but the drawback is that the input format is a bit weird. Let's get through
In [74]:
from Bio import SeqUtils
The input file of the script looks like:
#header
\n
ICE_ID
genome1&2 genome2&2 genome3&50 genome4&2 #list of genome ID with the number of gene in between the 2 core genes (included)
<Table with genes from the genomes with the ICE in columns, see below for description>
0.4 0.45 0.43 .... #GC% values along the spot.
\n
Where the table has the following columns:
| Column | description | Value | Comment |
|---|---|---|---|
| 1 | Gene_id | String | |
| 2 | Annotation | String | |
| 3 | Start | Integer | |
| 4 | End | Integer | |
| 5 | Gene orientation | 1/- 1 | |
| 6 | Replicon orientation | 1/-1 | compare to the other genome of the species, is this replicon sequenced on the same strand or not ? (same average orientation of the core genes) |
| 7 | Normalized orientation | (Gene orientation * Replicon orientation) | |
| 8 | Gene size | Integer | |
| 9 | MPF profile | MOBH/G_tfc23/... | if the gene is matched by macsyfinder, which profile matched ? |
| 10 | Integrase_type | Tyr_rec | if the gene is matched by tyrosine or serine recombinase profiles |
| 11 | Gene families | HAIN002_00160&HAIN009_00555 | genes in other genome of the species that are homologous. |
| 12 | RNA | tRNA&tRNA-Lys&143658&143734&1 | RFAM hits, in the line of the closest gene, and formatted as: RNA_type&RFAM_number&start&end&strand |
In any case, if there is not value, it must be a 0
In [75]:
with open("All_data_hain.txt", "w") as outf:
outf.write("#Annotation\tgene_name\tStart\tEnd\tStrand\tStrand_replicon\tStrand_norm\tSize\tGene\tIntegrase\tgene_fam\tRNA\n\n") # Header optional, could be an empty line
for sys_ID in conj.System_Id.unique():
df_sys_ID = pd.DataFrame(columns=fam.columns[:-1])
strain = sys_ID.split("_")[0]
for f in fam.iterrows():
try:
idx = f[1][f[1].str.contains(strain, na=False)].index
df_sys_ID.loc[f[1][idx[0]]] = f[1][f[1].index.isin(idx)==0].values
except IndexError:
pass
df_sys_ID.sort_index(inplace=1)
# spot_size is like: ['HAIN001&2', 'HAIN002&70', 'HAIN003&2', 'HAIN004&3', 'HAIN005&46', ...]
spot_size = ["&".join((m, str(n))) for m,n in pd.Series([i.split("_")[0].strip() for i in fam.values.ravel() if isinstance(i, str)]).value_counts().sort_index().iteritems()]
gbk = SeqIO.read("Sequences/Annotations/{strain}/{strain}.gbk".format(strain=strain), "genbank")
final = pd.DataFrame(columns=["Annotation", "gene_name", "Start", "End", "Strand"])
for feat in gbk.features[1:]:
if min(flanking_core_ID[strain]) <= feat.qualifiers["locus_tag"][0] <= max(flanking_core_ID[strain]):
if "gene" in feat.qualifiers:
gene_name = feat.qualifiers["gene"][0]
elif feat.type=="tRNA":
gene_name = feat.qualifiers["note"][0]
else:
gene_name = "HP"
final.loc[feat.qualifiers["locus_tag"][0]] = feat.qualifiers["product"][0], gene_name, feat.location.start, feat.location.end, feat.location.strand
final["Strand_replicon"] = sequencing_orientation[strain]
final["Strand_norm"] = final.Strand*final.Strand_replicon
final["Size"] = final.End - final.Start + 1
final = final.merge(conj.set_index("#Hit_Id")[["Gene"]], left_index=True, right_index=True, how="left")
final["Integrase"] = np.nan # If you detected integrases you can put something else than Nan or 0 in front of the corresponding gene.
gene_fams = df_sys_ID.apply(lambda x: "&".join(x.dropna()), axis=1).replace("", "0")
# gene_fams is like:
# HAIN002_00089 HAIN001_00088&HAIN003_00595&HAIN004_00583&HAIN...
# HAIN002_00093 HAIN007_00083&HAIN009_00615
# HAIN002_00094 HAIN007_00084&HAIN009_00614
# HAIN002_00095 HAIN007_00085&HAIN009_00613
# HAIN002_00096 HAIN007_00086&HAIN009_00612
final = final.merge(gene_fams.to_frame(name="gene_fam"), left_index=True, right_index=True, how='left')
final["RNA"] = final.apply(lambda x: "&".join(("tRNA",
x["Annotation"],
str(int(x["Start"])),
str(int(x["End"])),
str(int(x["Strand_norm"])))) if "tRNA" in x["Annotation"] else np.nan, axis=1)
final.fillna(0, inplace=1)
final[final.select_dtypes([float]).columns] = final.select_dtypes([float]).astype(int, inplace=1)
GC = [SeqUtils.GC(gbk.seq[final.Start[0]:final.End[-1]][i:i+1000]) for i in range(0, final.End[-1]-final.Start[0]+1, 1000)]
outf.write(sys_ID+"\n")
outf.write(" ".join(spot_size)+"\n")
final.to_csv(outf, header=None, sep="\t")
outf.write(" ".join(map(str, GC))+"\n\n")
print "Done"
Once you downloaded the script provided earlier, just type (in a terminal preferably, and not in a interactive python environnenment (note the ! at the beginning of the line, call a shell)
The window can be clicked:
Selection... that you can re-use. You could easily merge 2 exported files from different window.
In [77]:
!../ICE_plot/plot_ICE_spot.py All_data_hain.txt
Here is a screenshot, obtained with this entire pipeline:
In [78]:
ls
In [79]:
!head Limits_All_data_hain.txt
In [ ]: