An Efficient Lossy Compression Framework for Reducing Memory Footprint for Extreme-Scale Deep Learning on GPU-Based HPC Systems

Dingwen Tao, University of Alabama

XSEDE Allocation Request ASC200032

Abstract: This project is to develop an efficient lossy compression code to significantly reduce the memory footprint for extreme-scale deep neural networks (DNNs) on GPU-based high-performance computing (HPC) systems. DNNs have rapidly evolved as a state-of-the-art technique in many science and technology domains. The scales of DNNs are becoming larger because of increasing demand on the analysis quality for more complex applications to resolve, leading to extreme-scale DNNs. However, the ever-increasing scales of DNNs require a large amount of resources, such as memory, bringing more challenges to heterogeneous HPC systems. One important reason is the huge gap between the memory required by extreme-scale DNNs and the memory available in graphic processing units (GPUs). This gap compels researchers to use multiple GPUs, which would result in significant performance degradation due to expensive communications. We have identified that error-bounded lossy compression can offer high data reduction capability and precise error controllability for large-scale scientific applications. In this project, we will optimize the compression quality of our error-bounded lossy compressor SZ for different DNN intermediate data based on their features, such as sparsity and correlation. We will also conduct a performance optimization of the lossy compression code on state-of-the-art GPUs. Finally, we will integrate our lossy compressor into a distributed deep learning framework and evaluate its performance scalability with interdisciplinary applications on large-scale GPU-based HPC systems. We are requesting a Startup account to develop our lossy compression code and test its supported deep learning framework on the state-of-the-art GPU-based HPC system, i.e., Bridges GPU-AI at PSC.

Allocations:

2020 PSC GPU-AI (Bridges GPU Artificial Intelligence) 1,500.0 GPU Hours
2020 PSC Storage (Bridges Pylon) 500.0 GB
The estimated value of these awarded resources is $1,849.50. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for the U.S. The dollar value of the allocation is estimated from the NSF awards supporting the allocated resources.
There are no other allocations for this project.

Other Titles:

There are no prior titles for this project.