Liang Xu

This is Liang's blog for life and work archive.

Get genomes, annotate genomes and analyze genomes

A pipeline is created to estimate maximum growth rates of bacteria from a set of ASVs (fractions of genomes):

  1. Comparing ASVs with database to get the rough matches between genomes and ASVs; BLAST ASVs with database It is easy to feed ASVs fasta files to BLAST to get the matches;

  2. Fetch full genome sequences from NCBI by the genome ids;

  3. Annotate genomes using Prokka;

  4. Using the annotated genomes to estimate the maximum growth rates of bacteria by gRodon2.

The pipeline is tested on Caltech HPC.

Get genome fasta files from a tsv file generated by BLAST

This is a way to get genome data by genome id from a tsv file generated by BLAST. The tsv file contains information of the species id and the genomen hit ids. One example is provided in the repo. sh scripting is used to automate the downloading processes facilitated by the two tools datasets and dataformat from NCBI. The parallel fetching has been implemented. A job submission file is provided for the cluster usage.

Instal datasets and dataformat from NCBI

Follow the link to install these two tools.

Run getgenome_parallel_uniformdb.sh

Basically, what this script does is:

  1. Extract the species label, e.g. c497da3b39f30aceede6bec3b03cd100 and the genome id, e.g. GB_GCA_905618805.1 from each line in the tsv file. One species label may have several genome ids, also called hits;

  2. Generate folders for each genome id; Different ASVs may have the same match with the same genome. The repeated matched genome is downloaed once;

  3. Run datasets to download the genome data package. Make sure datasets is in the same folder as the sh script.

1
sh getgenome_parallel_uniformdb.sh

The package contains a bunch to information. See here for details. A fasta file is included in the package;

  1. Unzip the downloaded data.

Run getgenome_parallel.sh -number-of-cores -filename.tsv

Basically, what this script does is as the same as getgenome.sh. The parallel processing is implemented via GNU parallel. Note that two arguments are needed to run this sh as follows, taking MacOS terminal as an example:

1
sh getgenome_parallel.sh -number-of-cores -filename.tsv

Annotating genomes using Prokka

The script is prokka_genomes.sh

It reads the genomes sequences .fasta file and annotates and output a bunch of files. See Prokka for more details.

The output files will be stored in “prokkaoutput” folder. The genome annotations will be stored in the folder named after the genome.

Caveats

  1. Linux being tested. Running well on the Caltech cluster;

Future work

  1. Using gRodon to estimate the maximum growth rates.