Pipeline to delimit ICE

In this notebook, we'll find ICE and delimit them in the Haemophilus influenzae species

  1. First we'll get the complete genome from NCBI
  2. We'll build the core genome
  3. We'll detect the conjugative system in the genomes
  4. We'll identify the core-genes flanking the conjugative system (defining a spot)
  5. We'll build the pan-genome of the spot
  6. We'll make a visual representation of the spot with gene family frequency to delimit the ICE.

Approximate time ot run this notebook: ~15-20 minutes.

NB: The pipeline here works because the 3 ICEs we will found are in the same spot. But often, ICEs are found in different spots, and it implies to deal with that information to not mix everything.

Data

One needs to download the data, let's to it with python


In [2]:
mkdir Sequences

In [3]:
mkdir Sequences/Replicon

Download fasta sequences


In [4]:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "me@my_institute.org" #tells ncbi who you are.

I download the sequence from the NC number of 10 Haemophilus influenzae


In [5]:
for nc in ["NC_000907", "NC_007146", "NC_009566", "NC_009567", "NC_014920", "NC_014922", "NC_016809", "NC_017452", "NC_017451", "NC_022356"]:
    handle = Entrez.esearch(db="nucleotide", term=nc, rettype="gb", retmode="text")
    record = Entrez.read(handle)
    if len(record["IdList"]) == 0: # if no hit in genbank
        print nc
    handle2 = Entrez.efetch(db="nucleotide", id=record["IdList"][0], rettype="fasta", retmode="text")
    replicons = SeqIO.read(handle2, "fasta")
    SeqIO.write(replicons,
                "Sequences/Replicon/" + nc +".fst",
                "fasta")

In [6]:
!wc -l Sequences/Replicon/*


   30504 Sequences/Replicon/NC_000907.fst
   31910 Sequences/Replicon/NC_007146.fst
   30219 Sequences/Replicon/NC_009566.fst
   31455 Sequences/Replicon/NC_009567.fst
   33099 Sequences/Replicon/NC_014920.fst
   33452 Sequences/Replicon/NC_014922.fst
   33027 Sequences/Replicon/NC_016809.fst
   32207 Sequences/Replicon/NC_017451.fst
   30324 Sequences/Replicon/NC_017452.fst
   30938 Sequences/Replicon/NC_022356.fst
  317135 total

Make CDS annotation with Prokka

Prokka

$ brew tap homebrew/science
$ brew update
$ brew install prokka --HEAD

In [7]:
mkdir Sequences/Annotations

In [8]:
%%bash
i=1
for nc in "NC_000907" "NC_007146" "NC_009566" "NC_009567" "NC_014920" "NC_014922" "NC_016809" "NC_017452" "NC_017451" "NC_022356"; do
    printf -v j "%03d" $i
    echo HAIN"$j"
    prokka --kingdom Bacteria \
           --genus Haemophilus Sequences/Replicon/"$nc".fst \
           --outdir Sequences/Annotations/HAIN"$j" \
           --locustag HAIN"$j" \
           --prefix HAIN"$j" \
           --cpus 0 >> LOG_prokka 2>&1
    i=$((i+1))
done


HAIN001
HAIN002
HAIN003
HAIN004
HAIN005
HAIN006
HAIN007
HAIN008
HAIN009
HAIN010

Build Core genomes

With Roary

$ brew tap homebrew/science
$ brew install bedtools cd-hit blast mcl parallel prank mafft fasttree cpanm
$ sudo cpanm -f Bio::Roary

Move all .gff files in the same folder so Roary can work with them


In [9]:
mkdir Sequences/GFF

In [10]:
!cp Sequences/Annotations/*/*gff Sequences/GFF/

In [11]:
mkdir Sequences/Pangenome

In [12]:
!roary -p 20 -f Sequences/Pangenome/ Sequences/GFF/*.gff > LOG_roary 2>&1

Get core-genes

BEWARE to change the random number generated by roary (here _1494846018)


In [39]:
!query_pan_genome -g Sequences/Pangenome/_1494846018/clustered_proteins -a intersection Sequences/GFF/*.gff

In [40]:
mv pan_genome_results core_genome.txt

In [41]:
!tail core_genome.txt


dsbD_1: HAIN001_00951	HAIN002_01010	HAIN003_01520	HAIN004_01607	HAIN005_01388	HAIN006_00251	HAIN007_01075	HAIN008_01456	HAIN009_01504	HAIN010_00085
trpE: HAIN001_01452	HAIN002_01652	HAIN003_00921	HAIN004_00138	HAIN005_00718	HAIN006_01575	HAIN007_01603	HAIN008_00877	HAIN009_01018	HAIN010_01126
proC: HAIN001_00336	HAIN002_00410	HAIN003_00297	HAIN004_00874	HAIN005_00256	HAIN006_00978	HAIN007_00452	HAIN008_00275	HAIN009_00263	HAIN010_01776
citC: HAIN001_00026	HAIN002_00026	HAIN003_00657	HAIN004_00514	HAIN005_00029	HAIN006_00031	HAIN007_00026	HAIN008_00627	HAIN009_00682	HAIN010_01513
rapZ: HAIN001_01212	HAIN002_01244	HAIN003_01268	HAIN004_01875	HAIN005_01115	HAIN006_01908	HAIN007_01349	HAIN008_01214	HAIN009_01261	HAIN010_01473
glpA: HAIN001_00741	HAIN002_00778	HAIN003_01753	HAIN004_01350	HAIN005_01672	HAIN006_00598	HAIN007_00865	HAIN008_01679	HAIN009_01804	HAIN010_00410
trkH: HAIN001_00778	HAIN002_00817	HAIN003_01704	HAIN004_01390	HAIN005_01639	HAIN006_00570	HAIN007_00906	HAIN008_01634	HAIN009_01765	HAIN010_00179
yjjW: HAIN001_00548	HAIN002_00615	HAIN003_00077	HAIN004_01171	HAIN005_01773	HAIN006_00696	HAIN007_00675	HAIN008_00066	HAIN009_00057	HAIN010_00341
mreC: HAIN001_00039	HAIN002_00039	HAIN003_00644	HAIN004_00529	HAIN005_00042	HAIN006_00044	HAIN007_00039	HAIN008_00614	HAIN009_00669	HAIN010_01526
malP: HAIN001_01424	HAIN002_01682	HAIN003_00891	HAIN004_00091	HAIN005_00691	HAIN006_01604	HAIN007_01577	HAIN008_00850	HAIN009_01045	HAIN010_01143

In [42]:
!wc -l core_genome.txt


    1082 core_genome.txt

Remove first columns


In [43]:
!awk -F":" '{print $2}' core_genome.txt > core_genome_2.txt

Identification of Conjugative systems

I downloaded the module CONJScan with model and profiles for conjugative systems to use with macsyfinder. I put it under a Conjugation folder and rename it "Models"


In [18]:
mkdir Conjugation


mkdir: Conjugation: File exists

In [19]:
ls Conjugation/


Models/

In [20]:
%%bash
for seq in Sequences/Annotations/*/*faa; do
    name=${seq##*/}
    for conj_type in typeF typeB typeC typeFATA typeFA typeG typeI typeT; do
        macsyfinder "$conj_type" \
                    -w 20 \
                    --db-type ordered_replicon \
                    -d Conjugation/Models/DEF \
                    -p Conjugation/Models/HMM \
                    --profile-suffix .HMM \
                    --sequence-db "$seq" \
                    -o Conjugation/${name%%.*}\_"$conj_type" >> /dev/null # stdout is already reported in output files from macsyfinder (macsyfinder.out)
    done
done

In [21]:
!wc Conjugation/*/macsyfinder.report


       0       0       0 Conjugation/HAIN002_typeB/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeC/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeF/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeFA/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeFATA/macsyfinder.report
      22     352    3057 Conjugation/HAIN002_typeG/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeI/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeT/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeB/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeC/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeF/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeFA/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeFATA/macsyfinder.report
      22     352    3056 Conjugation/HAIN007_typeG/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeI/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeT/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeB/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeC/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeF/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeFA/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeFATA/macsyfinder.report
      22     352    3059 Conjugation/HAIN009_typeG/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeI/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeT/macsyfinder.report
      66    1056    9172 total

We have 3 ICEs of type G in strain 2, 7 and 9


In [22]:
!awk 'NR==1||FNR>1' Conjugation/*/macsyfinder.report > Conj_results

Identify the flanking core genes: finding a spot


In [23]:
import pandas as pd
import numpy as np

In [24]:
conj = pd.read_table("Conj_results")

In [44]:
core = pd.read_table("core_genome_2.txt", header=None, names=["HAIN{:03d}".format(i) for i in range(1,11)])

Reformat Macsyfinder output, replace UserReplicon by the name of the replicon. (I may have missed an option)


In [45]:
conj["Replicon_name"] = conj["#Hit_Id"].apply(lambda x: x[:7])

In [46]:
conj.System_Id.replace("UserReplicon", conj.Replicon_name, regex=True, inplace=1)

In [47]:
conj.head()


Out[47]:
#Hit_Id Replicon_name Position Sequence_length Gene Reference_system Predicted_system System_Id System_status Gene_status i-evalue Score Profile_coverage Sequence_coverage Begin_match End_match
0 HAIN002_00111 HAIN002 107 209 G_tfc2 typeG typeG HAIN002_typeG_1 single_locus accessory 2.600000e-60 200.1 0.920792 0.990431 2 208
1 HAIN002_00112 HAIN002 108 246 G_tfc3 typeG typeG HAIN002_typeG_1 single_locus accessory 5.800000e-88 290.7 0.975000 0.971545 7 245
2 HAIN002_00114 HAIN002 110 168 G_tfc5 typeG typeG HAIN002_typeG_1 single_locus accessory 6.600000e-56 185.4 0.982558 0.994048 2 168
3 HAIN002_00115 HAIN002 111 749 t4cp2 CONJ typeG HAIN002_typeG_1 single_locus mandatory 6.000000e-35 117.3 0.990991 0.329773 404 650
4 HAIN002_00117 HAIN002 113 228 G_tfc7 typeG typeG HAIN002_typeG_1 single_locus accessory 8.400000e-85 281.1 0.971888 0.991228 3 228

To identify the flanking core genes, I sort the column in the table core corresponding to the genome where I have an ICE, and find the index where I would insert any genes of that ICE


In [48]:
flanking_core = {}
for ice in conj.groupby("System_Id"):
    data = ice[1]
    replicon = data.Replicon_name.unique()[0]
    core.sort_values(replicon, inplace=1)
    spots = np.unique([core[replicon].searchsorted(gene)[0] for gene in data["#Hit_Id"]])
    assert len(spots)==1 # make sure that all genes of the conjugative system insert at the same spot
    flanking_core[ice[0]] = core.iloc[spots[0]-1:spots[0]+1].index

In [50]:
core.sort_index(inplace=1) # reset the order of the table

In [51]:
flanking_core


Out[51]:
{'HAIN002_typeG_1': Int64Index([221, 783], dtype='int64'),
 'HAIN007_typeG_1': Int64Index([221, 783], dtype='int64'),
 'HAIN009_typeG_1': Int64Index([783, 221], dtype='int64')}

In [52]:
flanking_core_ID = core.loc[flanking_core.values()[0]]

In [56]:
flanking_core_ID


Out[56]:
HAIN001 HAIN002 HAIN003 HAIN004 HAIN005 HAIN006 HAIN007 HAIN008 HAIN009 HAIN010
221 HAIN001_00088 HAIN002_00089 HAIN003_00595 HAIN004_00583 HAIN005_00080 HAIN006_00083 HAIN007_00079 HAIN008_00566 HAIN009_00619 HAIN010_00854
783 HAIN001_00093 HAIN002_00162 HAIN003_00590 HAIN004_00589 HAIN005_00130 HAIN006_00088 HAIN007_00151 HAIN008_00561 HAIN009_00513 HAIN010_00849

We see that they are all in the same spot, inbetween core genes 221 and 783, and that strain number 9 has been sequenced on the complementary strand or has a synteny breakpoint at that position. Let's eliminate that latter possibility.

There is a breakpoint if the two flanking coregenes in the list of core genes are not continguous (in case of a inversion of one of the two coregenes with a third one for instance). Meaning that the difference between their rank is either 1, -1 or the number of coregenes (-1) (to wrap around).

Getting their rank


In [53]:
order = core.values.argsort(axis=0)
ranks = order.argsort(axis=0)

In [57]:
for k,v in flanking_core.iteritems():
    
    rank_diff = np.abs(np.diff(ranks[[v[0], v[1]]], axis=0)[0])
    bkp = np.where((rank_diff!=1) & (rank_diff!=len(core)-1))[0]
    if len(bkp)>0:
        end_sentence = "is breakpoint for strain {}".format(bkp)
    else:
        end_sentence = "is no breakpoint in any strain"
    print "The ICE {} is in the spot flanked by coregenes {} and {} and there {}".format(k, v[0], v[1], end_sentence)


The ICE HAIN007_typeG_1 is in the spot flanked by coregenes 221 and 783 and there is no breakpoint in any strain
The ICE HAIN009_typeG_1 is in the spot flanked by coregenes 783 and 221 and there is no breakpoint in any strain
The ICE HAIN002_typeG_1 is in the spot flanked by coregenes 221 and 783 and there is no breakpoint in any strain

Here there is no breakpoint in any the strains at that spot. In case there is a breakpoint, one should remove the corresponding strain of the analysis.

Let's get the information on the relative sequencing orientation so we can normalize the strand position of the genes. We fix the reference with respect to the strain 2 (arbitrary, the first strain with an ICE)


In [58]:
sequencing_orientation = {i:j for i,j in zip(core.columns, np.median(np.diff(ranks[ranks[:,1].argsort()], axis=0), axis=0))}

In [59]:
sequencing_orientation


Out[59]:
{'HAIN001': 1.0,
 'HAIN002': 1.0,
 'HAIN003': -1.0,
 'HAIN004': 1.0,
 'HAIN005': -1.0,
 'HAIN006': -1.0,
 'HAIN007': 1.0,
 'HAIN008': -1.0,
 'HAIN009': -1.0,
 'HAIN010': 1.0}

So the first two strain and the 7th and 10th are in the same median orientation, other strain will need to be flipped. We took the median orientation so it's not garanteed that it will work, as the ICE might be in a bigger region that have been inversed

Pan-Genome of the spot

We now have the 2 flanking core genes. We'll gather all the genes that are in the spot and cluster them with uclust to make gene families.


In [60]:
np.unique(flanking_core.values())


Out[60]:
array([221, 783])

In [61]:
good_core = core.loc[np.unique(flanking_core.values())]
good_core


Out[61]:
HAIN001 HAIN002 HAIN003 HAIN004 HAIN005 HAIN006 HAIN007 HAIN008 HAIN009 HAIN010
221 HAIN001_00088 HAIN002_00089 HAIN003_00595 HAIN004_00583 HAIN005_00080 HAIN006_00083 HAIN007_00079 HAIN008_00566 HAIN009_00619 HAIN010_00854
783 HAIN001_00093 HAIN002_00162 HAIN003_00590 HAIN004_00589 HAIN005_00130 HAIN006_00088 HAIN007_00151 HAIN008_00561 HAIN009_00513 HAIN010_00849

Make list of gene for each interval


In [62]:
mkdir Spots

In [63]:
for strain in good_core.columns:
    i = [v.strip() for v in good_core[strain].values]
    f,l = int(i[0][-5:]), int(i[1][-5:])
    s = SeqIO.index("Sequences/Annotations/{strain}/{strain}.faa".format(strain=strain), "fasta")
    seqs = [s.get("{}{:05d}".format(i[0][:8], j)) for j in range(min(f,l), max(f,l)+1)]
    SeqIO.write([seq for seq in seqs if seq is not None],
                "Spots/{}_spot.faa".format(strain),
                "fasta")

Here we only have 1 spot, but one should keep track of the spots


In [64]:
!grep -c ">" Spots/*faa


Spots/HAIN001_spot.faa:2
Spots/HAIN002_spot.faa:70
Spots/HAIN003_spot.faa:2
Spots/HAIN004_spot.faa:3
Spots/HAIN005_spot.faa:46
Spots/HAIN006_spot.faa:2
Spots/HAIN007_spot.faa:69
Spots/HAIN008_spot.faa:2
Spots/HAIN009_spot.faa:102
Spots/HAIN010_spot.faa:2

In [65]:
!cat Spots/*faa > Spots/All_prot_spot.prt

In [66]:
mkdir Spots/Clusters

In [67]:
!usearch -quiet -cluster_fast Spots/All_prot_spot.prt \
         -id 0.7\
         -centroids Spots/All_prot_spot-70.fasta \
         -uc Spots/All_prot_spot-70.uc > Spots/log_usearch

In [68]:
cluster = pd.read_table("Spots/All_prot_spot-70.uc", header=None)

In [69]:
families = cluster[cluster[0]!="C"].groupby(1)[8].apply(lambda x: sorted(x)).values

In [70]:
fam = pd.DataFrame([sorted(f) for f in families])

In [72]:
fam.sort_values(by=fam.columns.tolist(), inplace=1)
fam.sort_values(by=fam.index.tolist(), axis=1, inplace=1)

Format data to plot

We use and old script I wrote which is quite ugly to read, but works. You can find it here: spot_ICE

The advantage is that it produces an interactive plot ot the spot, but the drawback is that the input format is a bit weird. Let's get through


In [74]:
from Bio import SeqUtils

The input file of the script looks like:

#header
\n
ICE_ID
genome1&2 genome2&2 genome3&50 genome4&2 #list of genome ID with the number of gene in between the 2 core genes (included)
<Table with genes from the genomes with the ICE in columns, see below for description>
0.4 0.45 0.43 .... #GC% values along the spot.
\n

Where the table has the following columns:

Column description Value Comment
1 Gene_id String
2 Annotation String
3 Start Integer
4 End Integer
5 Gene orientation 1/- 1
6 Replicon orientation 1/-1 compare to the other genome of the species, is this replicon sequenced on the same strand or not ? (same average orientation of the core genes)
7 Normalized orientation (Gene orientation * Replicon orientation)
8 Gene size Integer
9 MPF profile MOBH/G_tfc23/... if the gene is matched by macsyfinder, which profile matched ?
10 Integrase_type Tyr_rec if the gene is matched by tyrosine or serine recombinase profiles
11 Gene families HAIN002_00160&HAIN009_00555 genes in other genome of the species that are homologous.
12 RNA tRNA&tRNA-Lys&143658&143734&1 RFAM hits, in the line of the closest gene, and formatted as: RNA_type&RFAM_number&start&end&strand

In any case, if there is not value, it must be a 0


In [75]:
with open("All_data_hain.txt", "w") as outf:
    outf.write("#Annotation\tgene_name\tStart\tEnd\tStrand\tStrand_replicon\tStrand_norm\tSize\tGene\tIntegrase\tgene_fam\tRNA\n\n") # Header optional, could be an empty line

    for sys_ID in conj.System_Id.unique():
        df_sys_ID = pd.DataFrame(columns=fam.columns[:-1])
        strain = sys_ID.split("_")[0]
        for f in fam.iterrows():
            try:
                idx = f[1][f[1].str.contains(strain, na=False)].index
                df_sys_ID.loc[f[1][idx[0]]] = f[1][f[1].index.isin(idx)==0].values
            except IndexError:
                pass
        df_sys_ID.sort_index(inplace=1)
        # spot_size is like: ['HAIN001&2', 'HAIN002&70', 'HAIN003&2', 'HAIN004&3', 'HAIN005&46', ...]
        spot_size = ["&".join((m, str(n))) for m,n in pd.Series([i.split("_")[0].strip() for i in fam.values.ravel() if isinstance(i, str)]).value_counts().sort_index().iteritems()]

        gbk = SeqIO.read("Sequences/Annotations/{strain}/{strain}.gbk".format(strain=strain), "genbank")

        final = pd.DataFrame(columns=["Annotation", "gene_name", "Start", "End", "Strand"])
        for feat in gbk.features[1:]:
            if min(flanking_core_ID[strain]) <= feat.qualifiers["locus_tag"][0] <= max(flanking_core_ID[strain]):
                if "gene" in feat.qualifiers:
                    gene_name = feat.qualifiers["gene"][0]
                elif feat.type=="tRNA":
                    gene_name = feat.qualifiers["note"][0]
                else:
                    gene_name = "HP"
                final.loc[feat.qualifiers["locus_tag"][0]] = feat.qualifiers["product"][0], gene_name, feat.location.start, feat.location.end, feat.location.strand

        final["Strand_replicon"] = sequencing_orientation[strain]
        final["Strand_norm"] = final.Strand*final.Strand_replicon
        final["Size"] = final.End - final.Start + 1
        final = final.merge(conj.set_index("#Hit_Id")[["Gene"]], left_index=True, right_index=True, how="left")
        final["Integrase"] = np.nan # If you detected integrases you can put something else than Nan or 0 in front of the corresponding gene.
        gene_fams = df_sys_ID.apply(lambda x: "&".join(x.dropna()), axis=1).replace("", "0")
        # gene_fams is like:
        # HAIN002_00089    HAIN001_00088&HAIN003_00595&HAIN004_00583&HAIN...
        # HAIN002_00093                          HAIN007_00083&HAIN009_00615
        # HAIN002_00094                          HAIN007_00084&HAIN009_00614
        # HAIN002_00095                          HAIN007_00085&HAIN009_00613
        # HAIN002_00096                          HAIN007_00086&HAIN009_00612
        final = final.merge(gene_fams.to_frame(name="gene_fam"), left_index=True, right_index=True, how='left')
        final["RNA"] = final.apply(lambda x: "&".join(("tRNA",
                                                       x["Annotation"],
                                                       str(int(x["Start"])),
                                                       str(int(x["End"])),
                                                       str(int(x["Strand_norm"])))) if "tRNA" in x["Annotation"] else np.nan, axis=1)

        final.fillna(0, inplace=1)
        final[final.select_dtypes([float]).columns] = final.select_dtypes([float]).astype(int, inplace=1) 
        GC = [SeqUtils.GC(gbk.seq[final.Start[0]:final.End[-1]][i:i+1000]) for i in range(0, final.End[-1]-final.Start[0]+1, 1000)]
        outf.write(sys_ID+"\n")
        outf.write(" ".join(spot_size)+"\n")
        final.to_csv(outf, header=None, sep="\t")
        outf.write(" ".join(map(str, GC))+"\n\n")
print "Done"


Done

Plot the spot

Once you downloaded the script provided earlier, just type (in a terminal preferably, and not in a interactive python environnenment (note the ! at the beginning of the line, call a shell)

The window can be clicked:

  • left click on gene: print information for that genes
  • left click on any x-axis: erase all gene informations
  • right click on 2 genes in the same ICE: select the gene as a first and last genes delimiting the ICE
  • shift + right-click: remove the selected genes in the corresponding ICE.
    • Beware to click 2 times in the same ICE otherwise the limits will be mixed between ICEs (in case of odd number of right-clicks).
    • If you click 4 times in the same ICE, it will define 2 ICEs in tandem. (Or any even number of right click will define half as many ICE)
  • The spot are plotted 4 at a times, and the window close every 4. You can increase this number in the script if you have a big enough screen.
  • You can export an ICE in its spot that you'd like by ticking the box on the right (weird rectangles) and click export. It will create a file Selection... that you can re-use. You could easily merge 2 exported files from different window.

In [77]:
!../ICE_plot/plot_ICE_spot.py All_data_hain.txt

Here is a screenshot, obtained with this entire pipeline:


In [78]:
ls


All_data_hain.txt         LOG_roary                 Spots/
Conj_results              Limits_All_data_hain.txt  core_genome.txt
Conjugation/              OLD/                      core_genome_2.txt
Delimit_ICE.ipynb         Screenshot_spot_ICE.png
LOG_prokka                Sequences/

In [79]:
!head Limits_All_data_hain.txt


ICE_ID	first_gene	last_gene
HAIN002_typeG_1	HAIN002_00093	HAIN002_00160
HAIN007_typeG_1	HAIN007_00083	HAIN007_00149
HAIN009_typeG_1	HAIN009_00515	HAIN009_00615

In [ ]: