BLAST on the Cloud with NCBI’s ElasticBLAST
Build an analytic pipeline with ElasticBLAST, SNS, and DataBrew on AWS
Bioinformatic programs come and go, but BLAST stays.
BLAST, short for a Basic local alignment search tool, is the search engine for bioinformaticians. While Google takes text strings as queries and returns relevant web pages, BLAST accepts DNA or protein sequences as queries and returns similar sequences from the databases such as the Non-redundant Nucleotide database and the Non-redundant Protein database from the National Center for Biotechnology Information (NCBI).
BLAST is the bread and butter for all bioinformaticians. Published in 1990 by Altschul, its paper has been cited 92,993 times as of this writing. For every biologist, this evergreen piece of software is likely to be the first bioinformatics tool that he or she learned. Since its inception, many other sequence search tools have been developed, such as HMMER, RAPSearch, DIAMOND, and MMseqs, but none can dethrone the beloved BLAST in popularity. Just like “google”, BLAST has also become a verb.
BLAST can run locally. Although the program itself is small, the databases are not. For example, the formatted nucleotide database from the NCBI is larger than 130 GB. When it runs, it is I/O, CPU, and memory intensive. So we are more likely to see BLAST on some supercomputer clusters than on our desktops.
Alternatively, biologists can also perform BLAST searches online. The most popular one is hosted on NCBI’s website (Figure 1.). Although it is hugely convenient, web BLAST has a lot of limitations. Firstly, it sets a limit on the processing time, which indirectly limits the number of input sequences. Secondly, you can only query against their predefined databases. Thirdly, your searches are subjected to the settings defined by the NCBI, such as parallelism, output format, and so on. And finally, you are basically competing with biologists around the world for NCBI’s resources. So the search can be long sometimes.
The third option is the cloud. Cloud computing has democratized supercomputers. We can now run BLAST on the cloud efficiently and with our user settings, the best of both worlds. There are many ways of running BLAST on the cloud. In my previous article “Parallel BLAST against CAZy with AWS Batch”, I used Docker on AWS Batch to BLAST serverlessly against my custom database CAZy. It is also possible to provision EC2 instances to install and run BLAST.
Then Ravinder Pannu Eskandary from NCBI kindly informed me that NCBI is also developing a cloud-based BLAST: ElasticBLAST. She and her team have also kindly let me test the latest version of the software. ElasticBLAST is currently in the beta stage and runs on AWS and GCP. The commands are essentially the same across the two platforms. You can also test the software yourself by following their instructions.
This is a whole new ball game. As my BLAST article demonstrated, the setup of a homegrown cloud BLAST involves lots of clicks or a giant piece of Terraform. Now if NCBI steps in and takes over the BLAST setup, we the users can focus on what we do best: biology. It also means that now the whole workflow, from sequence upload through ElasticBLAST to the result analysis, is completely cloud-native (Figure 3). No data will leave the cloud if you so wish. The cloud is also scalable. It in fact encourages more concurrency since the costs are the same between parallel and sequential executions. Small research groups or biology hobbyists can now do resource-hungry bioinformatic analyses without their own supercomputers.
Currently, the NCBI team are focusing on the usability, scalability, and performance of the software. As an encouraged user, I would like to build a pipeline on top of ElasticBLAST. In this article, I am going to set up an SNS email notification, walk you through the ElasticBLAST example run, and then use DataBrew to do a visual analysis of the results. This tutorial runs on AWS.
1. Set up an S3 bucket with SNS email notification
Before we run ElasticBLAST, we need to create an S3 bucket to hold the incoming results. It is quite hard to monitor the job status later with the status
subcommand because it relies on a live CloudShell session and the session can time out before the jobs finish. As a workaround, we’d better set up email notifications in the bucket. This way, we will get emails from AWS when the jobs are done.
We need to create an SNS topic first before the S3 bucket. Log into your AWS console and head over to SNS. Create a Standard
topic blast-output-topic
.
We also need to modify the access policy
to permit the future S3 bucket to invoke SNS. Fill in your future bucket name and your account ID (the 12-digit ID next to My Account
in the drop-down menu of your username) in the template below and paste the content inside the JSON editor
field.
Afterward, create a subscription with the Email
protocol. Fill in your email address as the endpoint. Hit the Create subscription
button.
Then AWS will send you an email and you need to confirm the subscription by clicking its link.
Now head over to S3 and create a bucket with the name that you have filled in the policy previously. Mine is eblast-result-sixing
. Choose the us-east-1
region. Afterward, click into the Properties
tab of the new bucket, find the Event notifications
section and create an event notification called result received
. Write results/
in Prefix
and .gz
in Suffix
, because ElasticBLAST will write the results in .gz format in the results
folder. Enable All object create events
in Event types
. Select SNS topic
in Destination
and choose the blast-output-topic
in the drop-down menu.
You can now test the setup by creating a folder results
and upload a .gz file in it. If you receive an email with the topic name, congratulations!
2. Run ElasticBLAST in CloudShell
Now it is time to run ElasticBLAST itself. Open up a CloudShell session. The version is 0.2.5 as of this writing. And one command pip install wheel
is missing in NCBI’s official Quickstart. So execute the following commands to download and verify the binary elastic-blast
:
If you can see both the version number and the help messages, your installation was a success. Next, we need to write a configuration file. Adjust the content in the template below and paste it in a file called BDQA.ini
with either vi
or cat
(read here or here if you need help). You should then have one file in your current directory: BDQA.ini
.
This configuration file instructs ElasticBLAST to query the sequences inside the BDQA01.1.fsa_aa
file from the s3://elasticblast-test/queries/
bucket that NCBI has prepared for us. It is a viral metagenome associated with marine microorganisms. The job will run on 10 nodes and the results are going to be written into the results/[RunName]
folder in your S3 bucket. I have set the -outfmt
to 6
because this format is easier to process later than the 7
in NCBI’s documentation. The reference database is the refseq_protein. For more databases, please read the link below:
You can also make and upload your own database:
And you can enable the auto-shutdown feature:
Now execute this command to kick off the ElasticBLAST run:
Wait until the command finishes. You can monitor the status with:
In the output, you can see that ElasticBLAST splits the input into 11 parts. Once all 11 are in the Succeeded
row, the job is done.
But as I stated previously, it is quite hard to hang onto a live CloudShell session and watch the paint dry. Because we have set up an SNS notification, we can close CloudShell and check our emails instead. Once you receive 11 emails, you know the whole job is completed.
Before we move on, we should clean up the cloud resource to avoid a surprise bill from AWS.
3. Analyze the results in DataBrew
If the example runs are successful, you should see 11 batch_*.out.gz files in the results/[RunName]
folder in your S3 bucket. We can analyze these outputs visually with DataBrew right away.
First, head over to AWS Glue DataBrew. Create a project BDQA
. Then select New dataset
and name it BDQABlastOutput
. Adjust the following string with your ElasticBLAST output location and put it under Enter your source from S3
.
s3://[YourBucketName]/results/[RunName]/<[^/]+>.out.gz
The <[^/]+>.out.gz
part instructs DataBrew to search all files with the .out.gz extension.
Select CSV
under Selected file type
and Tab (\t)
as delimiter. Also, select Add default header
because BLAST’s outfmt 6 does not have a header. Adjust the Sampling
size to 5000, although I checked that there are only 3,796 lines of results (your results may vary). Finally, create a new IAM role with a suffix such as databrew-for-elasticblast
. Click Create project
.
Afterward, DataBrew will load the data into a spreadsheet with column statistics. Click SCHEMA
and correct the column names with these values in the exact order:
qaccver, saccver, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore, sskingdoms, ssciname
Now when you click GRID
, you should be greeted with the result in its full glory. It shows that there are a total of 3,796 BLAST hits.
Upon close inspection, it is clear that each query qaccver
can have multiple hits. We can remove the duplicate values by clicking the …
next to qaccver
and select Remove duplicate values
and click Apply
on the right panel. This way, we only inspect the top hit of each query.
After the dereplication, it reveals that out of the 548 input sequences, only 116 proteins have hits.
Now we can see that the percentage identities (pident) of these 116 proteins are not very high. The median is just 33.83%. It is an indication that these viral sequences were quite novel. Nine of them have top hits from the Bacteria. So the sample may have bacterial contaminations. Micromonas pusilla reovirus, Callinectes sapidus reovirus and Liao Ning virus are among the top five hit organisms. They all are linked to marine habitats. Micromonas pusilla reoviruses were isolated from the marine protist Micromonas pusilla, while Callinectes sapidus reoviruses are pathogenic to the blue crabs. According to Zhang et al., the Liao Ning viruses first appeared in Australia in the South Pacific region and then transmitted to mainland China. In contrast, Kadipiro viruses were isolated from mosquitoes.
Conclusion
ElasticBLAST is a big addition to our cloud-native bioinformatic workflow. Just with a web browser, we can BLAST sequences, clean the results and visualize the findings all in the cloud, without downloading, provisioning a supercomputer, or competing with colleagues for computer resources. It is scalable and cost-effective.
At its start, ElasticBLAST needs to spin up the infrastructure and transfer the databases. Once it warms up, the BLAST proper is performed. So the initial run can take a longer time than the web BLAST for small query datasets. But the subsequent runs no longer need the preparation and thus finish faster. So the larger the query dataset, the more speed gain.
Unlike the web BLAST, the user gets the bill this time. But the costs are quite low. The mobilization of the NCBI database does not incur costs. The user pays for the rest of the cloud service charges. On AWS, ElasticBLAST invokes the m5.8xlarge
instances. Each one has 32 vCPUs and an on-demand hourly rate of $1.536. Also, the first 1,000 emails in SNS are free. Each DataBrew interactive session is 30 minutes long and cost $1. However, the first 40 interactive sessions are free for first-time users.
The workflow in this article is just a start. You can set up another SNS to trigger an AWS Glue crawler. The crawler creates a data catalog out of the BLAST results and you can then use Athena to query it. You can also trigger a Lambda function to create a Krona Plot to visualize the results.
Use ElasticBLAST to supercharge your bioinformatics research today and tell me your experience, please.
I thank Ravinder Pannu Eskandary, Tom Madden, and Christiam Camacho from NCBI for giving me the opportunity.