"Data Locality Enhancement of Dynamic Simulations for Exascale Computing," Xipeng Shen, College of William and Mary
From Modelado Foundation
Motivation
Memory performance is key to maximizing computing efficiency in the era of Chip Multiprocessors (CMP) due to the growing disparity between the slowly expanded memory bandwidth and the rapidly increased demands for data by processors. The importance is underlined by the trend towards exascale computing, in which, the processors are expected to each contain hundreds or thousands of (heterogeneous) cores.
Computer simulation is important for scientific research in many disciplines. Many such programs are complex, and transfer a large amount of data in a dynamically changing pattern. Unfortunately, today's computer systems lack support for high degree of memory transfer. Although hardware cache can reduce the amount of memory access, scientific programs---especially those with dynamic reference patterns---are among the most difficult to cache. The issue has become more critical than before in CMP systems, where certain on-chip storage and off-chip pin bandwidth are shared among cores on a single chip. The sharing causes co-running threads to compete for the limited cache and bandwidth, further worsening the issue of memory performance. Many traditional optimization techniques, being oblivious to the new features, lose their effectiveness. It indicates that it is important for existing locality optimizations to evolve to match with the new features of multicore and manycore memory hierarchy, especially for dynamic scientific applications.
Summary of Proposed Techniques
In this research, the PI proposes to improve memory performance of dynamic applications by developing two new techniques that are tailored especially for the emerging features of CMP.
The first is asynchronous streamlining. It analyzes the memory reference patterns of an application during runtime and regulates both control flows and memory references on the fly. Its underlying observation is that both irregular memory references and control flows essentially stem from an inferior mapping between threads and data (data locations for the former; data values for the latter). It enhances the mapping through three components. The first includes a set of transformations for realizing new thread-data mappings. Its core consists of two primary mechanisms, data relocation and reference redirection, which work hand in hand to provide a spectrum of solutions suitable for various scenarios. There are two key factors for the transformations to work effectively: the computation of desirable data layouts or mappings, and the minimization and concealment of transformation overhead. The first is supported by optimality analysis and approximation algorithm design. The second is achieved by asynchronous transformation over a pipelining scheme and a two-level efficiency-driven adaptation. We have implemented a prototype of the technique, achieving 20--58% speedup for many scientific simulations, including molecular dynamics simulations, partial differential equation solving, 3-D image reconstruction. (details here)
The second technique, neighborhood-aware locality optimizations, concentrates on another key feature of CMP, the non-uniform relations among computing elements. It consists of two components. The first aims to reveal the implications of the non-uniformity on the interactions among co-running threads and on the use of on-chip storage and memory bandwidth. Based on traditional locality analysis, we propose concurrent reuse distance and cross-thread reference affinity analysis to capture inter-thread interferences and cross-thread data reference patterns. The second component translates the analysis results into performance improvement by adjusting the distribution of tasks, the organization of data in memory, and the scheduling of threads accordingly. We have developed a prototype of shared-cache--aware transformations, which yields 36% speedup on a set of multithreading applications. (details here)
This research will produce a robust tool for scientific users to enhance program locality on multi- and many-core systems that is not possible to achieve with existing tools. It is the next step from the PI's previous work in memory optimizations for scientific programs. It will contribute to the advance of computational sciences and promote academic research and education in the challenging field of scientific computing.
Quad chart: (Larger: File:QUAD.pdf)
Publications: here.