TIR Project

Search for bacterial TIR domain-containing proteins.

1) Set the working directory.



In [2]:

    
cd ~/Data/tir_project

2) Download the assembly summary from NCBI FTP.



In [ ]:

    
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

Column number	Column name	Example
1	assembly_accession	GCA_000174395.2
2	bioproject	PRJNA30627
3	biosample	SAMN00002237
4	wgs_master
5	refseq_category	reference genome
6	taxid	333849
7	species_taxid	1352
8	organism_name	Enterococcus faecium DO
9	infraspecific_name	strain=DO
10	isolate
10	version_status	latest
12	assembly_level	Complete Genome
13	release_type	Major
14	genome_rep	Full
15	seq_rel_date	2012/05/25
16	asm_name	ASM17439v2
17	submitter	Baylor College of Medicine
18	gbrs_paired_asm	GCF_000174395.2
19	paired_asm_comp	identical
20	ftp_path	ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2
21	excluded_from_refseq

3) Download categories.dmp from NCBI which links top level category (e.g. bacteria) to taxon ID



In [ ]:

    
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxcat.tar.gz
tar -xvzf taxcat.tar.gz
rm taxcat.tar.gz

categories.dmp contains a single line for each node that is at or below the species level in the NCBI Taxonomy database.

The first column is the top-level category -

A = Archaea B = Bacteria E = Eukaryota V = Viruses and Viroids U = Unclassified O = Other

The third column is the taxid itself, and the second column is the corresponding species-level taxid.

These nodes in the taxonomy -

242703 - Acidilobus saccharovorans 666510 - Acidilobus saccharovorans 345-15

will appear in categories.dmp as -

A 242703 242703
A 242703 666510



In [7]:

    
wc -l assembly_summary_genbank.txt









    



   92276 assembly_summary_genbank.txt

4) Extract bacterial assembly records and create link to protein file The FTP link for the assembly is:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2

Need to generate the FTP link for the protein file:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2/GCA_000174395.2_ASM17439v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2/GCF_000174395.2_ASM17439v2_protein.faa.gz

Link to protein file is comprised by column 20 / column 1 _ column 16 _protein.faa.gz



In [38]:

    
PATH=$PATH:~/Data/Notebooks/tir_project
extract_bacteria_assemblies.sh

5) Summarize genomes



In [14]:

    
awk 'BEGIN{FS="\t"} {gen[$5"\t"$12]++} END{for (x in gen) {print x"\t"gen[x]}}' bacteria_only_assembly_summary_genbank.txt | sort









    



na	Chromosome	1044
na	Complete Genome	4711
na	Contig	36884
na	Scaffold	38489
reference genome	Chromosome	2
reference genome	Complete Genome	118
representative genome	Chromosome	110
representative genome	Complete Genome	1468
representative genome	Contig	1761
representative genome	Scaffold	1741

6) Get protein sequences from bacterial assemblies which are reference genomes (column 5)



In [ ]:

    
awk ' BEGIN{FS="\t"}($5 == "reference genome"){print $22}' bacteria_only_assembly_summary_genbank.txt | xargs -L 1 wget --quiet -P ~/Data/tir_project/reference_genomes

7) Get protein sequences from bacterial assemblies which are representative genomes (column 5)



In [3]:

    
rg=($(awk ' BEGIN{FS="\t"}($5 == "representative genome"){print $22}' bacteria_only_assembly_summary_genbank.txt))



In [ ]:

    
x=0; 
err=0; 
for i in "${rg[@]}" 
do 
    ((x++))
    echo $x": "$i
    var="representative_genomes/"${i##*/}
    string=".faa.gz" 
    
    if [[ ! $i == *".faa.gz" ]]
    then 
        echo $i >> ~/Data/tir_project/representative_genomes_ftp.err; 
        ((err++)); 
    else 
        if [ ! -e $var ] 
        then 
            wget -P ~/Data/tir_project/representative_genomes $i 
        fi
        if [ ! -e $var ]
        then j=${i//GCA/GCF}; 
            wget -P ~/Data/tir_project/representative_genomes $j
        fi
    fi
done