1) Set the working directory.
In [2]:
cd ~/Data/tir_project
2) Download the assembly summary from NCBI FTP.
In [ ]:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
Column number | Column name | Example |
---|---|---|
1 | assembly_accession | GCA_000174395.2 |
2 | bioproject | PRJNA30627 |
3 | biosample | SAMN00002237 |
4 | wgs_master | |
5 | refseq_category | reference genome |
6 | taxid | 333849 |
7 | species_taxid | 1352 |
8 | organism_name | Enterococcus faecium DO |
9 | infraspecific_name | strain=DO |
10 | isolate | |
10 | version_status | latest |
12 | assembly_level | Complete Genome |
13 | release_type | Major |
14 | genome_rep | Full |
15 | seq_rel_date | 2012/05/25 |
16 | asm_name | ASM17439v2 |
17 | submitter | Baylor College of Medicine |
18 | gbrs_paired_asm | GCF_000174395.2 |
19 | paired_asm_comp | identical |
20 | ftp_path | ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2 |
21 | excluded_from_refseq |
3) Download categories.dmp from NCBI which links top level category (e.g. bacteria) to taxon ID
In [ ]:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxcat.tar.gz
tar -xvzf taxcat.tar.gz
rm taxcat.tar.gz
categories.dmp contains a single line for each node that is at or below the species level in the NCBI Taxonomy database.
The first column is the top-level category -
A = Archaea B = Bacteria E = Eukaryota V = Viruses and Viroids U = Unclassified O = Other
The third column is the taxid itself, and the second column is the corresponding species-level taxid.
These nodes in the taxonomy -
242703 - Acidilobus saccharovorans 666510 - Acidilobus saccharovorans 345-15
will appear in categories.dmp as -
A 242703 242703
A 242703 666510
In [7]:
wc -l assembly_summary_genbank.txt
4) Extract bacterial assembly records and create link to protein file
The FTP link for the assembly is:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2
Need to generate the FTP link for the protein file:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2/GCA_000174395.2_ASM17439v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2/GCF_000174395.2_ASM17439v2_protein.faa.gz
Link to protein file is comprised by column 20 / column 1 _ column 16 _protein.faa.gz
In [38]:
PATH=$PATH:~/Data/Notebooks/tir_project
extract_bacteria_assemblies.sh
5) Summarize genomes
In [14]:
awk 'BEGIN{FS="\t"} {gen[$5"\t"$12]++} END{for (x in gen) {print x"\t"gen[x]}}' bacteria_only_assembly_summary_genbank.txt | sort
6) Get protein sequences from bacterial assemblies which are reference genomes (column 5)
In [ ]:
awk ' BEGIN{FS="\t"}($5 == "reference genome"){print $22}' bacteria_only_assembly_summary_genbank.txt | xargs -L 1 wget --quiet -P ~/Data/tir_project/reference_genomes
7) Get protein sequences from bacterial assemblies which are representative genomes (column 5)
In [3]:
rg=($(awk ' BEGIN{FS="\t"}($5 == "representative genome"){print $22}' bacteria_only_assembly_summary_genbank.txt))
In [ ]:
x=0;
err=0;
for i in "${rg[@]}"
do
((x++))
echo $x": "$i
var="representative_genomes/"${i##*/}
string=".faa.gz"
if [[ ! $i == *".faa.gz" ]]
then
echo $i >> ~/Data/tir_project/representative_genomes_ftp.err;
((err++));
else
if [ ! -e $var ]
then
wget -P ~/Data/tir_project/representative_genomes $i
fi
if [ ! -e $var ]
then j=${i//GCA/GCF};
wget -P ~/Data/tir_project/representative_genomes $j
fi
fi
done