Data Movement Dominates

Energy is the fundamental barrier to exascale computing, and is dominated by the cost of moving data, not computation. Further, data movement, not computation, dominates the performance of real applications in HPC environments. This project will addresses the problems of data movement by examining three critical technologies: 3D integration, optical chip-to-chip communication and hardware support for logic operations in the memory system. To tie these together, new programming models will also be explored.

Approach

The proposed systems are being simulated by a combination of the PhoenixSim optical interconnect simulator; the DRAMsim advanced memory simulator; and the Structural Simulation Toolkit (SST) (parallel processor and I/O models). as well as a parallel simulation and power analysis infrastructure.

We plan to study the architectural innovation enabled by these technologies such as message- driven computation and very low power system designs. The DMD project aims to produce results that will allow DOE to select technology and architecture investments for an exascale system before the end of the decade. This approach is vendor agnostic, facilitating the broad adoption of these technologies and architectures.

Focus: Memory Architecture

Close connection of logic and memory has long been a goal of computer architecture. Low latency, high bandwidth connections between processing elements and main memory storage would eliminate the von Neumann bottleneck. Unfortunately, attempts to integrate memory and logic onto the same die have met with limited success due to differences in the fabrication processes for DRAM and high performance logic. Hybrid approaches such as eDRAM typically sacrifice memory density and have a high cost. The separation between logic and memory is the cause of the “memory wall” and is the dominant factor in node-level performance.

The use of 3D stacking will be the most fundamental change to main memory systems since the invention of DRAM. Its most important feature is the ability to integrate dense DRAM memory with high performance CMOS logic parts in the same package. By connecting logic and memory together in close proximity it is possible to move processing, data handling, and other tasks closer to the memory, reducing latency and power. Quantifying these savings will be a major focus of this project.

Early Results indicate performance improvements up to 270% for stacked emory and up to 300% for stacked memory with logic in memory.

Focus: Photonic Communication

A conceptual Photonics-enabled 3D stack.

Chip-scale nano-photonics is poised to uniquely address challenges in high performance computing due to its high IO bandwidth density and distance- independent signaling. However, important system- level questions must be answered before making use of this game-changing technology, such as how physical characteristics and constraints affect the architectural design and performance. Our combined experience with photonic device test and modeling along with our continued system-level simulation efforts will lead us towards realistic and feasible designs for enabling significant power and performance benefits with photonics for exascale computing.

Early results show great promise for reducing power, improving performance, and improving reliability through multicast memory accesses.

Focus: Multi-Level Memory

Multi-level main memories (stacked or DDR DRAM augmented with Non-volatile memory) have been the subject of a considerable amount of research in the last few years. However, there are still unexplored issues in interfaces in eviction schemes, particularly in the context of low power.

In almost all current studies the host interface of the non-volatile backing store for a hybrid memory is either not specified or assumed to be some existing interface such as DDR or PCIe. Little work has been done to determine the actual bandwidth requirements for such a backing store under different loads. Existing interfaces are not well suited to the unique needs of non-volatile hybrid main memories. DDR-type interfaces provide a great deal of half duplex bandwidth at a high ping cost. PCIe-type interfaces, on the other hand, provide full duplex communications with somewhat less bandwidth but fewer pins. We will develop a new interface which leverages the best of both approaches.

Focus: Processing-In-Memory

A conceptual Processing-in-memory stack.

A number of operations can be off-loaded to logic elements stacked with the DRAM or scattered through the memory hierarchy. Likely candidates for processing-in-memory are operations which have low computational requirements but high memory access requirements. This allows the logic components of the memory stack to be kept simple and low power while still maximizing their potential to accelerate execution.

Focus: Non-Volatile Memory

State of the art NAND flash chips have been designed to maximize capacity at the cost of performance. Until fairly recently, the speed of a random read to NAND flash was not a concern to flash manufacturers because the SPI interface on most flash cards was the primary bottleneck. Furthermore, the mobile / embedded market was more concerned with capacity rather than speed. However, the wide-spread adoption of high performance, enterprise SSDs has introduced flash to a new role that is limited by its read latency. There is the potential for redesign in the internal organization and peripheral circuitry of flash chips that would reverse the trend of trading latency for capacity. This could trade some capacity in exchange for read latency improvements of as much as 6-7x.

Collaboration

Exploring these techniques and developing this infrastructure requires a considerable depth and breadth of expertise. To meet this challenge, we have assembled a team of Laboratory, Academic, and Industry partners. This mix will ensure the project is is HPC-focused, cutting edge, and produces industrial quality results.

Infrastructure

All the participants in this project are committed to the development of an open framework for high performance computing architectural simulators based on The link Structural Simulator Toolkit. Simulator modules from each of the collaborators are expected to be interoperable under this framework. This integrated capability will be available as open source under an open licesnse to the high performance computing community.

Key Papers

Keren Bergman, Gilbert Hendry, Paul Hargrove, John Shalf, Bruce Jacob, K. Scott Hemmert, Arun Rodrigues, David Resnick, Let there be light!: the future of memory systems is photonics and 3D stacking. MSPC 2011: 43-48

S. Rumley, D. Nikolova, R. Hendry, Q. Li, D. Calhoun, K. Bergman,  "Silicon Photonics for Exascale Systems" Journal of Lightwave Technology  33 (4) (Oct 2014).

"Flexible auto-refresh: Enabling scalable and energy-efficient DRAM refresh reductions." Ishwar Bhati, Zeshan Chishti, Shih-Lien Lu, and Bruce Jacob. Proc. 42nd International Symposium on Computer Architecture (ISCA 2015) Portland OR, June 2015. (to appear)

"DRAM Refresh Mechanisms, Penalties, and Trade-Offs." Ishwar Bhati, Mu-Tien Chang, Zeshan Chishti, Shih-Lien Lu, and Bruce Jacob. IEEE Transactions on Computers, vol. 64 (to appear)