Pipeline to delimit ICE

In this notebook, we'll find ICE and delimit them in the Haemophilus influenzae species

First we'll get the complete genome from NCBI
We'll build the core genome
We'll detect the conjugative system in the genomes
We'll identify the core-genes flanking the conjugative system (defining a spot)
We'll build the pan-genome of the spot
We'll make a visual representation of the spot with gene family frequency to delimit the ICE.

2 Data

2.1 Download fasta sequences

2.2 Make CDS annotation with Prokka

3 Build Core genomes

4 Identification of Conjugative systems

5 Identify the flanking core genes: finding a spot

6 Pan-Genome of the spot

7 Format data to plot

7.1 Make a table for each ICE

8 Plot the spot

Approximate time ot run this notebook: ~15-20 minutes.

NB: The pipeline here works because the 3 ICEs we will found are in the same spot. But often, ICEs are found in different spots, and it implies to deal with that information to not mix everything.

Data

One needs to download the data, let's to it with python



In [2]:

    
mkdir Sequences



In [3]:

    
mkdir Sequences/Replicon

Download fasta sequences



In [4]:

    
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "me@my_institute.org" #tells ncbi who you are.

I download the sequence from the NC number of 10 Haemophilus influenzae



In [5]:

    
for nc in ["NC_000907", "NC_007146", "NC_009566", "NC_009567", "NC_014920", "NC_014922", "NC_016809", "NC_017452", "NC_017451", "NC_022356"]:
    handle = Entrez.esearch(db="nucleotide", term=nc, rettype="gb", retmode="text")
    record = Entrez.read(handle)
    if len(record["IdList"]) == 0: # if no hit in genbank
        print nc
    handle2 = Entrez.efetch(db="nucleotide", id=record["IdList"][0], rettype="fasta", retmode="text")
    replicons = SeqIO.read(handle2, "fasta")
    SeqIO.write(replicons,
                "Sequences/Replicon/" + nc +".fst",
                "fasta")



In [6]:

    
!wc -l Sequences/Replicon/*









    



   30504 Sequences/Replicon/NC_000907.fst
   31910 Sequences/Replicon/NC_007146.fst
   30219 Sequences/Replicon/NC_009566.fst
   31455 Sequences/Replicon/NC_009567.fst
   33099 Sequences/Replicon/NC_014920.fst
   33452 Sequences/Replicon/NC_014922.fst
   33027 Sequences/Replicon/NC_016809.fst
   32207 Sequences/Replicon/NC_017451.fst
   30324 Sequences/Replicon/NC_017452.fst
   30938 Sequences/Replicon/NC_022356.fst
  317135 total

Make CDS annotation with Prokka

Prokka

$ brew tap homebrew/science
$ brew update
$ brew install prokka --HEAD



In [7]:

    
mkdir Sequences/Annotations



In [8]:

    
%%bash
i=1
for nc in "NC_000907" "NC_007146" "NC_009566" "NC_009567" "NC_014920" "NC_014922" "NC_016809" "NC_017452" "NC_017451" "NC_022356"; do
    printf -v j "%03d" $i
    echo HAIN"$j"
    prokka --kingdom Bacteria \
           --genus Haemophilus Sequences/Replicon/"$nc".fst \
           --outdir Sequences/Annotations/HAIN"$j" \
           --locustag HAIN"$j" \
           --prefix HAIN"$j" \
           --cpus 0 >> LOG_prokka 2>&1
    i=$((i+1))
done









    



HAIN001
HAIN002
HAIN003
HAIN004
HAIN005
HAIN006
HAIN007
HAIN008
HAIN009
HAIN010

Build Core genomes

With Roary

$ brew tap homebrew/science
$ brew install bedtools cd-hit blast mcl parallel prank mafft fasttree cpanm
$ sudo cpanm -f Bio::Roary

Move all .gff files in the same folder so Roary can work with them



In [9]:

    
mkdir Sequences/GFF



In [10]:

    
!cp Sequences/Annotations/*/*gff Sequences/GFF/



In [11]:

    
mkdir Sequences/Pangenome



In [12]:

    
!roary -p 20 -f Sequences/Pangenome/ Sequences/GFF/*.gff > LOG_roary 2>&1

Get core-genes

BEWARE to change the random number generated by roary (here _1494846018)



In [39]:

    
!query_pan_genome -g Sequences/Pangenome/_1494846018/clustered_proteins -a intersection Sequences/GFF/*.gff



In [40]:

    
mv pan_genome_results core_genome.txt



In [41]:

    
!tail core_genome.txt









    



dsbD_1: HAIN001_00951	HAIN002_01010	HAIN003_01520	HAIN004_01607	HAIN005_01388	HAIN006_00251	HAIN007_01075	HAIN008_01456	HAIN009_01504	HAIN010_00085
trpE: HAIN001_01452	HAIN002_01652	HAIN003_00921	HAIN004_00138	HAIN005_00718	HAIN006_01575	HAIN007_01603	HAIN008_00877	HAIN009_01018	HAIN010_01126
proC: HAIN001_00336	HAIN002_00410	HAIN003_00297	HAIN004_00874	HAIN005_00256	HAIN006_00978	HAIN007_00452	HAIN008_00275	HAIN009_00263	HAIN010_01776
citC: HAIN001_00026	HAIN002_00026	HAIN003_00657	HAIN004_00514	HAIN005_00029	HAIN006_00031	HAIN007_00026	HAIN008_00627	HAIN009_00682	HAIN010_01513
rapZ: HAIN001_01212	HAIN002_01244	HAIN003_01268	HAIN004_01875	HAIN005_01115	HAIN006_01908	HAIN007_01349	HAIN008_01214	HAIN009_01261	HAIN010_01473
glpA: HAIN001_00741	HAIN002_00778	HAIN003_01753	HAIN004_01350	HAIN005_01672	HAIN006_00598	HAIN007_00865	HAIN008_01679	HAIN009_01804	HAIN010_00410
trkH: HAIN001_00778	HAIN002_00817	HAIN003_01704	HAIN004_01390	HAIN005_01639	HAIN006_00570	HAIN007_00906	HAIN008_01634	HAIN009_01765	HAIN010_00179
yjjW: HAIN001_00548	HAIN002_00615	HAIN003_00077	HAIN004_01171	HAIN005_01773	HAIN006_00696	HAIN007_00675	HAIN008_00066	HAIN009_00057	HAIN010_00341
mreC: HAIN001_00039	HAIN002_00039	HAIN003_00644	HAIN004_00529	HAIN005_00042	HAIN006_00044	HAIN007_00039	HAIN008_00614	HAIN009_00669	HAIN010_01526
malP: HAIN001_01424	HAIN002_01682	HAIN003_00891	HAIN004_00091	HAIN005_00691	HAIN006_01604	HAIN007_01577	HAIN008_00850	HAIN009_01045	HAIN010_01143



In [42]:

    
!wc -l core_genome.txt









    



    1082 core_genome.txt

Remove first columns



In [43]:

    
!awk -F":" '{print $2}' core_genome.txt > core_genome_2.txt

Identification of Conjugative systems

I downloaded the module CONJScan with model and profiles for conjugative systems to use with macsyfinder. I put it under a Conjugation folder and rename it "Models"



In [18]:

    
mkdir Conjugation









    



mkdir: Conjugation: File exists



In [19]:

    
ls Conjugation/









    



Models/



In [20]:

    
%%bash
for seq in Sequences/Annotations/*/*faa; do
    name=${seq##*/}
    for conj_type in typeF typeB typeC typeFATA typeFA typeG typeI typeT; do
        macsyfinder "$conj_type" \
                    -w 20 \
                    --db-type ordered_replicon \
                    -d Conjugation/Models/DEF \
                    -p Conjugation/Models/HMM \
                    --profile-suffix .HMM \
                    --sequence-db "$seq" \
                    -o Conjugation/${name%%.*}\_"$conj_type" >> /dev/null # stdout is already reported in output files from macsyfinder (macsyfinder.out)
    done
done



In [21]:

    
!wc Conjugation/*/macsyfinder.report









    



       0       0       0 Conjugation/HAIN002_typeB/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeC/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeF/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeFA/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeFATA/macsyfinder.report
      22     352    3057 Conjugation/HAIN002_typeG/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeI/macsyfinder.report
       0       0       0 Conjugation/HAIN002_typeT/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeB/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeC/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeF/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeFA/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeFATA/macsyfinder.report
      22     352    3056 Conjugation/HAIN007_typeG/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeI/macsyfinder.report
       0       0       0 Conjugation/HAIN007_typeT/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeB/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeC/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeF/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeFA/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeFATA/macsyfinder.report
      22     352    3059 Conjugation/HAIN009_typeG/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeI/macsyfinder.report
       0       0       0 Conjugation/HAIN009_typeT/macsyfinder.report
      66    1056    9172 total

We have 3 ICEs of type G in strain 2, 7 and 9



In [22]:

    
!awk 'NR==1||FNR>1' Conjugation/*/macsyfinder.report > Conj_results

Identify the flanking core genes: finding a spot



In [23]:

    
import pandas as pd
import numpy as np



In [24]:

    
conj = pd.read_table("Conj_results")



In [44]:

    
core = pd.read_table("core_genome_2.txt", header=None, names=["HAIN{:03d}".format(i) for i in range(1,11)])

Reformat Macsyfinder output, replace UserReplicon by the name of the replicon. (I may have missed an option)



In [45]:

    
conj["Replicon_name"] = conj["#Hit_Id"].apply(lambda x: x[:7])



In [46]:

    
conj.System_Id.replace("UserReplicon", conj.Replicon_name, regex=True, inplace=1)



In [47]:

    
conj.head()









    Out[47]:






  
    
      
      #Hit_Id
      Replicon_name
      Position
      Sequence_length
      Gene
      Reference_system
      Predicted_system
      System_Id
      System_status
      Gene_status
      i-evalue
      Score
      Profile_coverage
      Sequence_coverage
      Begin_match
      End_match
    
  
  
    
      0
      HAIN002_00111
      HAIN002
      107
      209
      G_tfc2
      typeG
      typeG
      HAIN002_typeG_1
      single_locus
      accessory
      2.600000e-60
      200.1
      0.920792
      0.990431
      2
      208
    
    
      1
      HAIN002_00112
      HAIN002
      108
      246
      G_tfc3
      typeG
      typeG
      HAIN002_typeG_1
      single_locus
      accessory
      5.800000e-88
      290.7
      0.975000
      0.971545
      7
      245
    
    
      2
      HAIN002_00114
      HAIN002
      110
      168
      G_tfc5
      typeG
      typeG
      HAIN002_typeG_1
      single_locus
      accessory
      6.600000e-56
      185.4
      0.982558
      0.994048
      2
      168
    
    
      3
      HAIN002_00115
      HAIN002
      111
      749
      t4cp2
      CONJ
      typeG
      HAIN002_typeG_1
      single_locus
      mandatory
      6.000000e-35
      117.3
      0.990991
      0.329773
      404
      650
    
    
      4
      HAIN002_00117
      HAIN002
      113
      228
      G_tfc7
      typeG
      typeG
      HAIN002_typeG_1
      single_locus
      accessory
      8.400000e-85
      281.1
      0.971888
      0.991228
      3
      228

To identify the flanking core genes, I sort the column in the table core corresponding to the genome where I have an ICE, and find the index where I would insert any genes of that ICE



In [48]:

    
flanking_core = {}
for ice in conj.groupby("System_Id"):
    data = ice[1]
    replicon = data.Replicon_name.unique()[0]
    core.sort_values(replicon, inplace=1)
    spots = np.unique([core[replicon].searchsorted(gene)[0] for gene in data["#Hit_Id"]])
    assert len(spots)==1 # make sure that all genes of the conjugative system insert at the same spot
    flanking_core[ice[0]] = core.iloc[spots[0]-1:spots[0]+1].index



In [50]:

    
core.sort_index(inplace=1) # reset the order of the table



In [51]:

    
flanking_core









    Out[51]:





{'HAIN002_typeG_1': Int64Index([221, 783], dtype='int64'),
 'HAIN007_typeG_1': Int64Index([221, 783], dtype='int64'),
 'HAIN009_typeG_1': Int64Index([783, 221], dtype='int64')}



In [52]:

    
flanking_core_ID = core.loc[flanking_core.values()[0]]



In [56]:

    
flanking_core_ID









    Out[56]:






  
    
      
      HAIN001
      HAIN002
      HAIN003
      HAIN004
      HAIN005
      HAIN006
      HAIN007
      HAIN008
      HAIN009
      HAIN010
    
  
  
    
      221
      HAIN001_00088
      HAIN002_00089
      HAIN003_00595
      HAIN004_00583
      HAIN005_00080
      HAIN006_00083
      HAIN007_00079
      HAIN008_00566
      HAIN009_00619
      HAIN010_00854
    
    
      783
      HAIN001_00093
      HAIN002_00162
      HAIN003_00590
      HAIN004_00589
      HAIN005_00130
      HAIN006_00088
      HAIN007_00151
      HAIN008_00561
      HAIN009_00513
      HAIN010_00849

We see that they are all in the same spot, inbetween core genes 221 and 783, and that strain number 9 has been sequenced on the complementary strand or has a synteny breakpoint at that position. Let's eliminate that latter possibility.

There is a breakpoint if the two flanking coregenes in the list of core genes are not continguous (in case of a inversion of one of the two coregenes with a third one for instance). Meaning that the difference between their rank is either 1, -1 or the number of coregenes (-1) (to wrap around).

Getting their rank



In [53]:

    
order = core.values.argsort(axis=0)
ranks = order.argsort(axis=0)



In [57]:

    
for k,v in flanking_core.iteritems():
    
    rank_diff = np.abs(np.diff(ranks[[v[0], v[1]]], axis=0)[0])
    bkp = np.where((rank_diff!=1) & (rank_diff!=len(core)-1))[0]
    if len(bkp)>0:
        end_sentence = "is breakpoint for strain {}".format(bkp)
    else:
        end_sentence = "is no breakpoint in any strain"
    print "The ICE {} is in the spot flanked by coregenes {} and {} and there {}".format(k, v[0], v[1], end_sentence)









    



The ICE HAIN007_typeG_1 is in the spot flanked by coregenes 221 and 783 and there is no breakpoint in any strain
The ICE HAIN009_typeG_1 is in the spot flanked by coregenes 783 and 221 and there is no breakpoint in any strain
The ICE HAIN002_typeG_1 is in the spot flanked by coregenes 221 and 783 and there is no breakpoint in any strain

Here there is no breakpoint in any the strains at that spot. In case there is a breakpoint, one should remove the corresponding strain of the analysis.

Let's get the information on the relative sequencing orientation so we can normalize the strand position of the genes. We fix the reference with respect to the strain 2 (arbitrary, the first strain with an ICE)



In [58]:

    
sequencing_orientation = {i:j for i,j in zip(core.columns, np.median(np.diff(ranks[ranks[:,1].argsort()], axis=0), axis=0))}



In [59]:

    
sequencing_orientation









    Out[59]:





{'HAIN001': 1.0,
 'HAIN002': 1.0,
 'HAIN003': -1.0,
 'HAIN004': 1.0,
 'HAIN005': -1.0,
 'HAIN006': -1.0,
 'HAIN007': 1.0,
 'HAIN008': -1.0,
 'HAIN009': -1.0,
 'HAIN010': 1.0}

So the first two strain and the 7th and 10th are in the same median orientation, other strain will need to be flipped. We took the median orientation so it's not garanteed that it will work, as the ICE might be in a bigger region that have been inversed

Pan-Genome of the spot

We now have the 2 flanking core genes. We'll gather all the genes that are in the spot and cluster them with uclust to make gene families.



In [60]:

    
np.unique(flanking_core.values())









    Out[60]:





array([221, 783])



In [61]:

    
good_core = core.loc[np.unique(flanking_core.values())]
good_core









    Out[61]:






  
    
      
      HAIN001
      HAIN002
      HAIN003
      HAIN004
      HAIN005
      HAIN006
      HAIN007
      HAIN008
      HAIN009
      HAIN010
    
  
  
    
      221
      HAIN001_00088
      HAIN002_00089
      HAIN003_00595
      HAIN004_00583
      HAIN005_00080
      HAIN006_00083
      HAIN007_00079
      HAIN008_00566
      HAIN009_00619
      HAIN010_00854
    
    
      783
      HAIN001_00093
      HAIN002_00162
      HAIN003_00590
      HAIN004_00589
      HAIN005_00130
      HAIN006_00088
      HAIN007_00151
      HAIN008_00561
      HAIN009_00513
      HAIN010_00849

Make list of gene for each interval



In [62]:

    
mkdir Spots



In [63]:

    
for strain in good_core.columns:
    i = [v.strip() for v in good_core[strain].values]
    f,l = int(i[0][-5:]), int(i[1][-5:])
    s = SeqIO.index("Sequences/Annotations/{strain}/{strain}.faa".format(strain=strain), "fasta")
    seqs = [s.get("{}{:05d}".format(i[0][:8], j)) for j in range(min(f,l), max(f,l)+1)]
    SeqIO.write([seq for seq in seqs if seq is not None],
                "Spots/{}_spot.faa".format(strain),
                "fasta")

Here we only have 1 spot, but one should keep track of the spots



In [64]:

    
!grep -c ">" Spots/*faa









    



Spots/HAIN001_spot.faa:2
Spots/HAIN002_spot.faa:70
Spots/HAIN003_spot.faa:2
Spots/HAIN004_spot.faa:3
Spots/HAIN005_spot.faa:46
Spots/HAIN006_spot.faa:2
Spots/HAIN007_spot.faa:69
Spots/HAIN008_spot.faa:2
Spots/HAIN009_spot.faa:102
Spots/HAIN010_spot.faa:2



In [65]:

    
!cat Spots/*faa > Spots/All_prot_spot.prt



In [66]:

    
mkdir Spots/Clusters



In [67]:

    
!usearch -quiet -cluster_fast Spots/All_prot_spot.prt \
         -id 0.7\
         -centroids Spots/All_prot_spot-70.fasta \
         -uc Spots/All_prot_spot-70.uc > Spots/log_usearch



In [68]:

    
cluster = pd.read_table("Spots/All_prot_spot-70.uc", header=None)



In [69]:

    
families = cluster[cluster[0]!="C"].groupby(1)[8].apply(lambda x: sorted(x)).values



In [70]:

    
fam = pd.DataFrame([sorted(f) for f in families])



In [72]:

    
fam.sort_values(by=fam.columns.tolist(), inplace=1)
fam.sort_values(by=fam.index.tolist(), axis=1, inplace=1)

Format data to plot

We use and old script I wrote which is quite ugly to read, but works. You can find it here: spot_ICE

The advantage is that it produces an interactive plot ot the spot, but the drawback is that the input format is a bit weird. Let's get through



In [74]:

    
from Bio import SeqUtils

The input file of the script looks like:

#header
\n
ICE_ID
genome1&2 genome2&2 genome3&50 genome4&2 #list of genome ID with the number of gene in between the 2 core genes (included)
<Table with genes from the genomes with the ICE in columns, see below for description>
0.4 0.45 0.43 .... #GC% values along the spot.
\n

Where the table has the following columns:

Column	description	Value	Comment
1	Gene_id	String
2	Annotation	String
3	Start	Integer
4	End	Integer
5	Gene orientation	1/- 1
6	Replicon orientation	1/-1	compare to the other genome of the species, is this replicon sequenced on the same strand or not ? (same average orientation of the core genes)
7	Normalized orientation	(Gene orientation * Replicon orientation)
8	Gene size	Integer
9	MPF profile	MOBH/G_tfc23/...	if the gene is matched by macsyfinder, which profile matched ?
10	Integrase_type	Tyr_rec	if the gene is matched by tyrosine or serine recombinase profiles
11	Gene families	HAIN002_00160&HAIN009_00555	genes in other genome of the species that are homologous.
12	RNA	tRNA&tRNA-Lys&143658&143734&1	RFAM hits, in the line of the closest gene, and formatted as: RNA_type&RFAM_number&start&end&strand

In any case, if there is not value, it must be a 0



In [75]:

    
with open("All_data_hain.txt", "w") as outf:
    outf.write("#Annotation\tgene_name\tStart\tEnd\tStrand\tStrand_replicon\tStrand_norm\tSize\tGene\tIntegrase\tgene_fam\tRNA\n\n") # Header optional, could be an empty line

    for sys_ID in conj.System_Id.unique():
        df_sys_ID = pd.DataFrame(columns=fam.columns[:-1])
        strain = sys_ID.split("_")[0]
        for f in fam.iterrows():
            try:
                idx = f[1][f[1].str.contains(strain, na=False)].index
                df_sys_ID.loc[f[1][idx[0]]] = f[1][f[1].index.isin(idx)==0].values
            except IndexError:
                pass
        df_sys_ID.sort_index(inplace=1)
        # spot_size is like: ['HAIN001&2', 'HAIN002&70', 'HAIN003&2', 'HAIN004&3', 'HAIN005&46', ...]
        spot_size = ["&".join((m, str(n))) for m,n in pd.Series([i.split("_")[0].strip() for i in fam.values.ravel() if isinstance(i, str)]).value_counts().sort_index().iteritems()]

        gbk = SeqIO.read("Sequences/Annotations/{strain}/{strain}.gbk".format(strain=strain), "genbank")

        final = pd.DataFrame(columns=["Annotation", "gene_name", "Start", "End", "Strand"])
        for feat in gbk.features[1:]:
            if min(flanking_core_ID[strain]) <= feat.qualifiers["locus_tag"][0] <= max(flanking_core_ID[strain]):
                if "gene" in feat.qualifiers:
                    gene_name = feat.qualifiers["gene"][0]
                elif feat.type=="tRNA":
                    gene_name = feat.qualifiers["note"][0]
                else:
                    gene_name = "HP"
                final.loc[feat.qualifiers["locus_tag"][0]] = feat.qualifiers["product"][0], gene_name, feat.location.start, feat.location.end, feat.location.strand

        final["Strand_replicon"] = sequencing_orientation[strain]
        final["Strand_norm"] = final.Strand*final.Strand_replicon
        final["Size"] = final.End - final.Start + 1
        final = final.merge(conj.set_index("#Hit_Id")[["Gene"]], left_index=True, right_index=True, how="left")
        final["Integrase"] = np.nan # If you detected integrases you can put something else than Nan or 0 in front of the corresponding gene.
        gene_fams = df_sys_ID.apply(lambda x: "&".join(x.dropna()), axis=1).replace("", "0")
        # gene_fams is like:
        # HAIN002_00089    HAIN001_00088&HAIN003_00595&HAIN004_00583&HAIN...
        # HAIN002_00093                          HAIN007_00083&HAIN009_00615
        # HAIN002_00094                          HAIN007_00084&HAIN009_00614
        # HAIN002_00095                          HAIN007_00085&HAIN009_00613
        # HAIN002_00096                          HAIN007_00086&HAIN009_00612
        final = final.merge(gene_fams.to_frame(name="gene_fam"), left_index=True, right_index=True, how='left')
        final["RNA"] = final.apply(lambda x: "&".join(("tRNA",
                                                       x["Annotation"],
                                                       str(int(x["Start"])),
                                                       str(int(x["End"])),
                                                       str(int(x["Strand_norm"])))) if "tRNA" in x["Annotation"] else np.nan, axis=1)

        final.fillna(0, inplace=1)
        final[final.select_dtypes([float]).columns] = final.select_dtypes([float]).astype(int, inplace=1) 
        GC = [SeqUtils.GC(gbk.seq[final.Start[0]:final.End[-1]][i:i+1000]) for i in range(0, final.End[-1]-final.Start[0]+1, 1000)]
        outf.write(sys_ID+"\n")
        outf.write(" ".join(spot_size)+"\n")
        final.to_csv(outf, header=None, sep="\t")
        outf.write(" ".join(map(str, GC))+"\n\n")
print "Done"









    



Done

Plot the spot

Once you downloaded the script provided earlier, just type (in a terminal preferably, and not in a interactive python environnenment (note the ! at the beginning of the line, call a shell)

The window can be clicked:

left click on gene: print information for that genes
left click on any x-axis: erase all gene informations
right click on 2 genes in the same ICE: select the gene as a first and last genes delimiting the ICE
shift + right-click: remove the selected genes in the corresponding ICE.
- Beware to click 2 times in the same ICE otherwise the limits will be mixed between ICEs (in case of odd number of right-clicks).
- If you click 4 times in the same ICE, it will define 2 ICEs in tandem. (Or any even number of right click will define half as many ICE)
The spot are plotted 4 at a times, and the window close every 4. You can increase this number in the script if you have a big enough screen.
You can export an ICE in its spot that you'd like by ticking the box on the right (weird rectangles) and click export. It will create a file Selection... that you can re-use. You could easily merge 2 exported files from different window.



In [77]:

    
!../ICE_plot/plot_ICE_spot.py All_data_hain.txt

Here is a screenshot, obtained with this entire pipeline:



In [78]:

    
ls









    



All_data_hain.txt         LOG_roary                 Spots/
Conj_results              Limits_All_data_hain.txt  core_genome.txt
Conjugation/              OLD/                      core_genome_2.txt
Delimit_ICE.ipynb         Screenshot_spot_ICE.png
LOG_prokka                Sequences/



In [79]:

    
!head Limits_All_data_hain.txt









    



ICE_ID	first_gene	last_gene
HAIN002_typeG_1	HAIN002_00093	HAIN002_00160
HAIN007_typeG_1	HAIN007_00083	HAIN007_00149
HAIN009_typeG_1	HAIN009_00515	HAIN009_00615



In [ ]:

	#Hit_Id	Replicon_name	Position	Sequence_length	Gene	Reference_system	Predicted_system	System_Id	System_status	Gene_status	i-evalue	Score	Profile_coverage	Sequence_coverage	Begin_match	End_match
0	HAIN002_00111	HAIN002	107	209	G_tfc2	typeG	typeG	HAIN002_typeG_1	single_locus	accessory	2.600000e-60	200.1	0.920792	0.990431	2	208
1	HAIN002_00112	HAIN002	108	246	G_tfc3	typeG	typeG	HAIN002_typeG_1	single_locus	accessory	5.800000e-88	290.7	0.975000	0.971545	7	245
2	HAIN002_00114	HAIN002	110	168	G_tfc5	typeG	typeG	HAIN002_typeG_1	single_locus	accessory	6.600000e-56	185.4	0.982558	0.994048	2	168
3	HAIN002_00115	HAIN002	111	749	t4cp2	CONJ	typeG	HAIN002_typeG_1	single_locus	mandatory	6.000000e-35	117.3	0.990991	0.329773	404	650
4	HAIN002_00117	HAIN002	113	228	G_tfc7	typeG	typeG	HAIN002_typeG_1	single_locus	accessory	8.400000e-85	281.1	0.971888	0.991228	3	228

	HAIN001	HAIN002	HAIN003	HAIN004	HAIN005	HAIN006	HAIN007	HAIN008	HAIN009	HAIN010
221	HAIN001_00088	HAIN002_00089	HAIN003_00595	HAIN004_00583	HAIN005_00080	HAIN006_00083	HAIN007_00079	HAIN008_00566	HAIN009_00619	HAIN010_00854
783	HAIN001_00093	HAIN002_00162	HAIN003_00590	HAIN004_00589	HAIN005_00130	HAIN006_00088	HAIN007_00151	HAIN008_00561	HAIN009_00513	HAIN010_00849