DynAX Project

Dynamically Adaptive X-Stack or DynAX is a team led by ET International to conduct a research on runtime software for exascale computing.

Moving forward, exascale software will be unable to rely on minimally invasive system interfaces to provide an execution environment. Instead, a software runtime layer is necessary to mediate between an application and the underlying hardware and software. This proposal describes a model of execution based on codelets, which are small pieces of work that are sequenced by expressing their interdependencies to runtime software instead of relying on the implicit sequencing of a software thread. In addition, this document will also describe interactions between the runtime layer, compiler, and programming language.

The runtime software for exascale computing must be able to deal with a very large amount of outstanding work at any given time and manage enormous amounts of data, some of which may be highly volatile. The relationship between work and the data it acts upon or generates is crucial to maintaining high performance and low power usage. A poor understanding of data locality may lead to a much higher amount of communication, which is extremely undesirable in an exascale environment. To assist it in associating work with data and facilitating the migration of work to data and vice versa, such a runtime may impose a hierarchy on regions of the system, dividing it up along address space and privacy boundaries to allow it to guess at the cost inherent in communicating between regions. Furthermore, tying data and work to locations in the hierarchy creates a construct by which transparent work stealing and sharing may be applied, helping to keep stolen work near its data and allowing shared work to be issued to specific regions.

Compilers also need to reflect the requirements of exascale computing systems. A compiler that supports a codelet execution model must be able to determine appropriate boundaries for codelets in software and generate the codelets and code to interface with both the runtime and a transformed version of the input program. We propose that a three-step compilation process be used, wherein program code is compiled down to a high-level-language-independent internal representation, which can then be compiled down to C code that makes API calls into runtime software. This C code can then be compiled down to a platform-specific binary for execution on the target system, using existing C compilers for the generated sequential code. Higher-level analysis of the relationship between codelets and data can be performed in earlier steps, and this can enable the compiler to emit static hints to the runtime to assist in making decisions on scheduling and placement. Compilers can also assist in providing for fault tolerance by supporting containment domains, which can be used by the runtime software to assist in program checkpointing.

This work will be done in the context of DOE co-design applications. We will use kernels of these applications as well as other benchmarks and synthetic kernels in the course of our research. The needs of the co-design applications will provide valuable feedback to the research process.

R-Stream

Current capabilities:
- Automatic parallelization and mapping
- Heterogeneous, hierarchical targets
- Automatic DMA/comm. generation/optimization
- Auto-tuning tile sizes, mapping strategies, etc.
- Scheduling with parallelism/locality layout tradeoffs
- Corrective array expansion
Planned capabilities:
- Extend explicit data placement
- Generation of parallel codelet codes from serial codes
- Generation of SCALE IR and tuning hints on scheduling and data placement
- Automatic mapping of irregular mesh codes

Hierarchical Tiled Arrays

HTAs are recursive data structure
- Tree structured representation of memory
Includes library of operations to enable the programming of codelets in the familiar notation of C/C++
- Represent parallelism using operations on arrays and sets
- Represent parallelism using parallel constructs such as parallel loops
Compiler optimizations on sequences of HTA operations will be evaluated

Rescinded Primitive Data Type Access

Redundancy removal to improve performance/energy
- Communication
- Storage
Redundancy addition to improve fault tolerance
- High Level fault tolerant error correction codes and their distributed placement
Placeholder representation for aggregated data elements
- Memory allocation/deallocation/copying
- Memory consistency models

NWChem

DOE’s Premier computational chemistry software
One-of-a-kind solution scalable with respect to scientific challenge and compute platforms
From molecules and nanoparticles to solid state and biomolecular systems
Open-source has greatly expanded user and developer base (ECL 2.0)
Worldwide distribution (70% is academia)
Ab initio molecular dynamics runs at petascale
Scalability to 100,000 processors demonstrated
Smart data distribution and communication algorithms enable hybrid-DFT to scale to large numbers of processors

Deliverables

Q1 (12/1/2012)

Q2 (3/1/2013)

March 2013 PI Meeting: PI Meeting presentation

EXaCT All-hands meeting: DynAX Presentation

Q3 (6/1/2013)

Year 1 report: Y1 Report

Q4 (9/1/2013)

Q4 Report

Q5 (12/1/2013)

Q6 (3/1/2014)

Year 2 report: Y2 Report

DynAX Project

From Modelado Foundation

Deliverables