The Microbial Genomes Atlas Science Gateway – MiGA @ XSEDE: A Searchable Database of Prokaryotic Genomes for Taxonomic Identification and Diversity Cataloguing

Konstantinos Konstantinidis, Georgia Institute of Technology

XSEDE Allocation Request MCB190042

CoPI: Luis Rodriguez Rojas University of Innsbruck
Abstract: The diversity of prokaryotic microbes on the planet is very large, estimated at over one billion species of bacteria, and most of it remains undiscovered. As genome sequencing can help characterizing this diversity and has recently become routine, most microbial scientists have been overwhelmed by the amount of genomic data that were made recently available. Tools that can help direct researchers to the most "interesting" genomes among thousands of candidates will be important, including for identification (diagnostics) of microbial disease agents in addition to diversity discovery. Such tools are currently not available and/or do not offer online searching capabilities of unknown (query) genomes against all available genomes; that is, they do not scale with the sequence data that are becoming available. In response, we recently introduced the Microbial Genomes Atlas (MiGA) (Rodriguez-R et al., NAR, 2018), a genomic data processing and management system that uses whole-genome comparisons for the identification of relatives and taxonomic classification, and provides several tools for genome quality evaluation and genome clustering for novel microorganisms. Together with the MiGA infrastructure, we also released the MiGA Online webserver, an online system that allows users evaluating, comparing, and classifying their own genome sequences against different reference databases including the collection of all complete prokaryotic genomes in NCBI (NCBI_Prok, ~15,000 genomes), all reference genomes derived from type material (TypeMat; ~15,000 genomes), and two large collections of metagenome-assembled genomes (MAGs and Parks8, with ~3,000 and ~8,000 genomes, respectively), among several others. MiGA is currently being used by hundreds of users, and has already processed about 45,000 query genomes, which is remarkable for a resource first reported less than 3 year ago and a testament that MiGA fulfils a critical need of contemporary research and education. Indeed, MiGA has already been used for the proposal of novel taxa, the identification and classification of microbial genomes, and to discuss data-driven microbial taxonomy. Notably, MiGA is currently unique among related efforts by others in that it allows external users to query their own sequences against MiGA’s internal databases and not only provides taxonomic classification but also assessment of genome quality, completeness, and gene content variation. MiGA includes a series of heuristics to allow the rapid identification of closest relatives using whole-genome comparisons. However, indexing and processing query datasets remains computationally challenging, given the size and growth rate of the databases. We currently invest around 240 thousand CPU hours each month on this task, the equivalent of over 300 dedicated CPUs at 100% capacity. In our previous allocation project, we have successfully implemented MiGA and indexed several reference genome databases on Comet and, more recently, on Expanse. We also recruited a few outside users who successfully tested this implementation and offered suggestions on improving its web-interface. Accordingly, in the present renewal application, we propose to advertise MiGA@XSEDE more broadly and recruit hundreds, if not thousands of users from around the world, to perform their genome analysis on the Expanse supercomputer. The availability of the MiGA infrastructure on XSEDE will allow any researcher to perform high-throughput analysis that is currently not available elsewhere, for both research and education. We have successfully delivered bioinformatics workshops using the MiGA Command Line Interface in the past focused on processing genomic and metagenomic data, at a small scale (processing ~10 genomes), using commercial cloud computing (Amazon Web Services). With MiGA available in XSEDE, these educational materials would also be available to any person with access to the XSEDE system, and at a large scale. Our projections indicate that about 5 million SSUs per year will be required to support the scientific community that is interested in using the MiGA infrastructure, which is a modest prediction given that this community is broad and covers the fields of ecology, systematics, evolution, engineering, agriculture, and medicine.

Allocations:

2021 SDSC Expanse Projects Storage 6,000.0 GB
2021 SDSC Dell Cluster with AMD Rome HDR IB (Expanse) 5,332,000.0 Core-hours
The estimated value of these awarded resources is $23,760.80. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for the U.S. The dollar value of the allocation is estimated from the NSF awards supporting the allocated resources.
2019 XSEDE Extended Collaborative Support Yes
2019 SDSC Dell Cluster with Intel Haswell Processors (Comet) 700,000.0 SUs
2019 SDSC Medium-term disk storage (Data Oasis) 3,072.0 GB
The estimated value of these awarded resources is $11,618.21. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for the U.S. The dollar value of the allocation is estimated from the NSF awards supporting the allocated resources.
There are no other allocations for this project.

Other Titles:

There are no prior titles for this project.