Applications
From Modelado Foundation
The propose of this page is to gather user applications that serve as poster children for HHAT.
A link to get back up to the parent page is here.
Please this this approach
- Create a new subsection for each application, with two equal signs and a space around the title of each app
- Include the content in the template below
CORAL apps
Collaboration of Oak Ridge, Argonne and Livermore
APEX apps
Alliance for Application Performance at Extreme Scale
ECP apps
Exascale computing project
PASC apps
Platform for Advanced Scientific Computing, Switzerland
Shortly there will be more information about this initiative, but generally speaking Switzerland is founding development projects for libraries and applications with strong performance-portable code in areas like weather and climate, material science and molecular dynamics. Many computational kernels there are based on linear algebra, some other are based on finite differences.
GridTools
One of the projects is producing a set of (C++ header-based) libraries for finite differences in weather and climate applications. The main idea is to allow for the numeric operators to be expressed in a possibly grid-agnostic way, while the grid, wether representing a local region or the whole globe, is plugged in a second step. The libraries provide means of composition for the different operators , so that to allow to increase the computation-intensity of otherwise memory bound stencils, allowing specification of boundary conditions in a very flexible way, perform nearest neighbor communication operations, and domain decomposition. The central component of the set of libraries is the composition of different operators. All of the libraries, however, have backends to execute the requested tasks on specific architectures. Currently supported are x86-based multicores and nVidia GPUs. Xeon Phi is in a early stage of implementation. A plan to orchestrate the different activities (stencil execution, boundary conditions, communications) using some for of dynamic scheduling is one of the goals we are pursuing. Employing a more dynamic execution policy for each computational phase is not currently considered a urgent matter, since the scheduling of the operations is basically known at compile time. Future directions may include adopting a more dynamic approach in both high- and low-levels if such an integration is beneficial for performance.
ISV apps
Sandia's Task-DAG R&D 2014-2016
Sandia conducted a three year laboratory directed research and development (LDRD) effort to explore on-node, performance portable directed acyclic graph (DAG) of tasks parallel pattern, usage algorithms, application programmer interface, scheduling algorithms, and implementations. Of significance this LDRD used C++ meta-programming to achieve performance portability across CPU and NVIDIA GPU (CUDA) architectures. The above document is the final report for this R&D.
The prototype developed through this LDRD is currently (2017) being matured (overhauled) to address performance issues and elevate to production quality. This effort is scheduled for delivery within Kokkos by September 2017.
TRALEIKA GLACIER X-STACK Project
The XStack Traleika Glacier (XSTG) project was a three-year research award for exploring a revolutionary exascaleclass machine software framework. The XSTG program, including Intel, UC San Diego, Pacific Northwest National Lab, UIUC, Rice University, Reservoir Labs, ET International, and U. Delaware, had major accomplishments, insights, and products resulting from this three-year effort.
Its technical artifacts were primarily 1) a novel hardware architecture (Traleika Glacier) and a simulator for this architecture, 2) a specification of a DAG parallel, asynchronous tasking, low-level runtime called the Open Community Runtime (OCR), 3) several implementations of OCR including a reference implementation and the PNNL optimized OCR implementation (P-OCR), 4) the layering of several higher level programming models on top of OCR including CnC, HClib, and HTA, and 5) implementation of several DoE mini-apps and other applications on top of OCR or the higher level programming models.
Apps included:
- Smith Waterman
- Cholesky decomposition
- Two NWChem kernels (Self-Consistent Field and Coupled Cluster methods)
- CoMD
- HPGMG
Habanero Tasking Micro-Benchmark Suite
This micro-benchmarking suite is a work-in-progress intended to compare low-level overheads across low-level tasking runtimes (e.g. Realm, OCR). The above Github page includes a high-level description of each micro-benchmark, as well as source code for each micro-benchmark across a variety of low-level runtimes. These micro-benchmarks were curated across performance regression suites from a variety of tasking runtimes, and so is intended to enable one-to-one comparison of runtime efficiencies (as much as possible).
Categories of Hierarchical Algorithms
David Keyes volunteered to offer several such categories.
Charm++
Charm++ is a task-based asynchronous parallel runtime system used in applications such as ChaNGa and NAMD, among others. The runtime system has mechanisms to control the placement, scheduling, and execution target of the program, all of which could use HHAT components when available. A key property is that these properties are generally decided dynamically during execution by the runtime system, and retaining this dynamism is critical for the ideology of the software. Of particular interest is infrastructure for managing heterogeneous execution, data location, network communication, and (user and system level) threading.
VMD
VMD is a tool for preparing, analyzing, and visualizing molecular dynamics simulations and lattice cell simulations.
VMD has a large user community with over 100,000 registered users, and it is used on a broad range of hardware and operating system platforms that covers tablets, conventional laptops, PCs, as well as large clusters and supercomputers, and systems with and without GPU accelerators, recent many-core CPUs with wide vector units and some that have extensive hardware SMT. On clusters and supercomputers, VMD uses MPI for distributed memory execution. VMD implements a diverse range of internal algorithms and makes increasing use of third party software components and external libraries, particularly those that offer high performance platform-optimized algorithms, e.g., for linear algebra, FFTs, sequence alignment (bioinformatics), and others. VMD currently makes use of a light weight tasking system that originated in the Tachyon ray tracing library (that VMD also uses), which provides cross-platform APIs for threading, synchronization, and tasking. The tasking implementation in VMD and Tachyon incorporates special features to allow consideration of jobs that span both CPUs and GPU accelerators, handle unexpected runtime errors, and allow retry or relocation of task indices that contained work that could not be executed by an accelerator and migrates them to host CPUs, etc.
It should be a short-term effort (under a week?) to reimplement the key parts of the VMD/Tachyon tasking abstractions on top of a HiHat implementation, which would enable performance and usability comparisons to be made even with early HiHat implementations.
VMD HiHat Requirements and User Story
Library solution
A library based approach is strongly preferred because VMD already incorporates a multiplicity of compilers for its various component software. At typical x86 Linux build of VMD is compiled with Intel C/C++, NVIDIA NVCC, Intel ISPC, and GNU C/C++, and final linkage is done with Intel C/C++. It is already a challenge to manage complex compiler, OS, and library compatibility issues with this present arrangement, and adding the need to use yet another compiler or compiler/language extensions would likely complicate this even further.
Platform support
Since VMD runs on a very wide range of hardware and OS combinations that range from tablets, laptops, desktops all the way to clusters and supercomputers, HiHat must by definition be usable across all of those platforms, although the various criteria for optimal user experience, performance, energy efficiency, etc are different across these platforms.
Language binding
C binding might be preferable. C++ APIs and objects have a tendency to become "viral", potentially affecting far reaching parts of application code unless the application uses its own abstraction layer to limit this. That said, I could envision some nice benefits from C++ bindings if they allowed improved interoperability between HiHat and current and future C++ language features, and the like.
Performance criteria
I prefer most overheads to be shifted toward application startup, particularly when it comes to enumerating NUMA hardware topology, accelerators, and building up persistent hardware thread pools, CUDA streams, and other resources that are required later on in the midst of performance critical task scheduling activities.
Some CPU algorithms achieve peak performance with a mapping of tasks indices to a number of hardware threads that matches the number of physical CPU cores, that is, without any use of additional SMT threads. These algorithms are typically very arithmetic intensive, and therefore the use of SMT is unnecessary to hide memory latencies, and in fact use of SMT may slow performs. On some CPUs it can be the case that use of SMT comes at a cost in terms of register spills or other scarce on-chip resources, so the calling application needs to have some control over whether and to what degree SMT is used to schedule tasks (e.g. what SMT depth is used).
VMD algorithms often make use of finite capacity on-chip or on-board memory systems e.g. KNL MCDRAM, GPU HBM, constant/texture caches, etc. When scheduling multiple independent queues onto the available processors (CPUs, GPUs, ...) it should be possible for the application to exploit innate deep knowledge of required working set size within these finite memory systems to inform HiHat scheduling so that runtime errors don't occur, and performance is maximized.
Handling of exceptions, runtime errors, underlying hardware errors/failures
There are a variety of scenarios where it is best for performance to design algorithms for a GPU or many-core wide-vector CPUs with a "optimistic and greedy" approach, that is they schedule all work units on target devices with the assumption that exceptions or problematic work units are extremely rare. In such cases one can often achieve higher performance by launching the work optimistically, and having a CPU "cleanup thread" deal with work units that encountered exceptional arithmetic conditions, resource limitations, or other cases where special treatment might be required that is not well suited to the kernel implementation running on the GPU. In such cases the cost of cleaning up the rare exceptions is usually lower than if the CPU investigated the work units and rearranged or compacted the task set submitted to the GPU to explicitly avoid the problematic ones.
Thinking in terms of future hardware and robust execution in the face of hardware errors, it should perhaps be possible to handle transient hardware faults or resource exaustion scenarios at some level of granularity, perhaps allowing affected tasks to be rescheduled off of an accelerator such as a GPU and onto the available host CPUs, or a different subset of GPUs that either have more resources, or similar.
Add more
Template
Application 1
- Brief description of app and its business importance
- Brief description of app domain
- Qualitative or quantitative analysis of where and how it would benefit from HHAT
- Expected time table for delivery of a solution (e.g. readiness for the arrival of a new supercomputer at a USG lab), and resources available to implement it with HHAT
- purpose: identify apps that could lead vehicles that drive the development of an open source project and that would be a poster child that would build confidence for others to follow