Deep-Learning Whole-Genome Prediction for Complex Plant Genomes

Charles Chen, Oklahoma State University

0000-0002-2203-0433

ACCESS Allocation Request MCB180177

Abstract: Global climate change is altering habit conditions at an unprecedented pace. Yet, even to this date, it remains unclear if and how agricultural production, including cereal crops and forest systems, can keep pace with these changes and sustain the critical need to feed a growing human population and support the planet's wellbeing. Taking advantage of the next-generation sequencing technology, crop variety development and tree improvement programs have a keen interest in early estimation of agronomical performance, such as end-use quality, productivity, as well as growth and adaptive attributes, longing for the capacity of a genetics-driven paradigm shift to increase adaptability and climate resilience in crop plants. However, prediction and association analyses with genetic markers like single nucleotide polymorphism (SNP) fell short, because SNP variants identified in association with trait variability confer far less heritability than expected from the empirical estimates, leading to unreliable predictions and a great letdown in technology adoption. In the past decade, a growing number of studies have demonstrated the substantial impacts on the total fitness and adaptive capacity of plants as a result of structural variants (SVs)- genomic variations like copy number variations, deletions, insertions, tandem duplications and inversions that spans a greater region of nucleotides. However, SVs are a composite of a variable length of nucleotides, and often overlap, because of their size; the unstructured representations of SVs have made the compatibility with existing statistical algorithms challenging and even more so to interpret with the presence of single nucleotide mutations and substitutions like SNPs. Taking advantage of deep learning at the critical step of feature extraction and embedding, we have proposed a novel deep learning framework for whole-genome predictive analysis. Our approach seeks predictability by incorporating the rawest form of genomic information, the DNA sequences in which all genomic variants will be simultaneously modeled for prediction purpose, including both of structured (SNPs) and unstructured (SVs) data. This XSEDE application is to acquire adequate computing resources for the identification of SV, and for the construction and verification of the capacity of our deep learning prediction model for agriculturally and ecologically important wheat and conifer species.

Allocations:

2022 PSC Bridges-2 Storage (PSC Ocean) 86,000.0 GB
2022 PSC Bridges-2 Regular Memory (PSC Bridges-2 RM) 569,467.0 Core-hours
2022 PSC Bridges-2 GPU (PSC Bridges-2 GPU) 20,000.0 GPU Hours
2022 PSC Bridges-2 Extreme Memory (PSC Bridges-2 EM) 267,024.0 Core-hours
The estimated value of these awarded resources is $67,096.16. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for the U.S. The dollar value of the allocation is estimated from the NSF awards supporting the allocated resources.
2020 PSC Bridges-2 Extreme Memory (PSC Bridges-2 EM) 414,720.0 Core-hours
2020 PSC Bridges-2 Storage (PSC Ocean) 60,000.0 GB
2020 PSC Bridges-2 Regular Memory (PSC Bridges-2 RM) 271,875.0 Core-hours
2020 PSC GPU-AI (Bridges GPU Artificial Intelligence) 75,000.0 GPU Hours
The estimated value of these awarded resources is $174,414.30. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for the U.S. The dollar value of the allocation is estimated from the NSF awards supporting the allocated resources.
Click to show/hide prior allocations »
2018 PSC GPU-AI (Bridges GPU Artificial Intelligence) 15,000.0 GPU Hours
2018 PSC Storage (Bridges Pylon) 72,000.0 GB
2018 PSC Large Memory (Bridges Large) 58,900.0 Memory Hours
2018 PSC Regular Memory (Bridges) 334,617.0 SUs
The estimated value of these awarded resources is $69,415.58. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for the U.S. The dollar value of the allocation is estimated from the NSF awards supporting the allocated resources.

Other Titles:

Click to show/hide prior titles »
Deep Convolutional Neural Network Whole-Genome Prediction by Structural Variants in Complex Plant Genomes
Predictive Modeling for Climate Resilient Phenotypes in Mega-Size Plant Genomes