Revision as of 11:52, March 21, 2017

The propose of this page is to gather user applications that serve as poster children for HHAT.

A link to get back up to the parent page is here.

Please this this approach

Create a new subsection for each application, with two equal signs and a space around the title of each app
Include the content in the template below

CORAL apps

Collaboration of Oak Ridge, Argonne and Livermore

APEX apps

Alliance for Application Performance at Extreme Scale

ECP apps

Exascale computing project

PASC apps

Platform for Advanced Scientific Computing, Switzerland

Shortly there will be more information about this initiative, but generally speaking Switzerland is founding development projects for libraries and applications with strong performance-portable code in areas like weather and climate, material science and molecular dynamics. Many computational kernels there are based on linear algebra, some other are based on finite differences.

GridTools

One of the projects is producing a set of (C++ header-based) libraries for finite differences in weather and climate applications. The main idea is to allow for the numeric operators to be expressed in a possibly grid-agnostic way, while the grid, wether representing a local region or the whole globe, is plugged in a second step. The libraries provide means of composition for the different operators , so that to allow to increase the computation-intensity of otherwise memory bound stencils, allowing specification of boundary conditions in a very flexible way, perform nearest neighbor communication operations, and domain decomposition. The central component of the set of libraries is the composition of different operators. All of the libraries, however, have backends to execute the requested tasks on specific architectures. Currently supported are x86-based multicores and nVidia GPUs. Xeon Phi is in a early stage of implementation. A plan to orchestrate the different activities (stencil execution, boundary conditions, communications) using some for of dynamic scheduling is one of the goals we are pursuing. Employing a more dynamic execution policy for each computational phase is not currently considered a urgent matter, since the scheduling of the operations is basically known at compile time. Future directions may include adopting a more dynamic approach in both high- and low-levels if such an integration is beneficial for performance.

ISV apps

Sandia's Task-DAG R&D 2014-2016

Sandia's Task-DAG LDRD report

Sandia conducted a three year laboratory directed research and development (LDRD) effort to explore on-node, performance portable directed acyclic graph (DAG) of tasks parallel pattern, usage algorithms, application programmer interface, scheduling algorithms, and implementations. Of significance this LDRD used C++ meta-programming to achieve performance portability across CPU and NVIDIA GPU (CUDA) architectures. The above document is the final report for this R&D.

The prototype developed through this LDRD is currently (2017) being matured (overhauled) to address performance issues and elevate to production quality. This effort is scheduled for delivery within Kokkos by September 2017.

TRALEIKA GLACIER X-STACK Project

Final Technical Report

The XStack Traleika Glacier (XSTG) project was a three-year research award for exploring a revolutionary exascaleclass machine software framework. The XSTG program, including Intel, UC San Diego, Pacific Northwest National Lab, UIUC, Rice University, Reservoir Labs, ET International, and U. Delaware, had major accomplishments, insights, and products resulting from this three-year effort.

Its technical artifacts were primarily 1) a novel hardware architecture (Traleika Glacier) and a simulator for this architecture, 2) a specification of a DAG parallel, asynchronous tasking, low-level runtime called the Open Community Runtime (OCR), 3) several implementations of OCR including a reference implementation and the PNNL optimized OCR implementation (P-OCR), 4) the layering of several higher level programming models on top of OCR including CnC, HClib, and HTA, and 5) implementation of several DoE mini-apps and other applications on top of OCR or the higher level programming models.

Apps included:

Smith Waterman
Cholesky decomposition
Two NWChem kernels (Self-Consistent Field and Coupled Cluster methods)
CoMD
HPGMG

Habanero Tasking Micro-Benchmark Suite

Github page

This micro-benchmarking suite is a work-in-progress intended to compare low-level overheads across low-level tasking runtimes (e.g. Realm, OCR). The above Github page includes a high-level description of each micro-benchmark, as well as source code for each micro-benchmark across a variety of low-level runtimes. These micro-benchmarks were curated across performance regression suites from a variety of tasking runtimes, and so is intended to enable one-to-one comparison of runtime efficiencies (as much as possible).

Categories of Hierarchical Algorithms

David Keyes volunteered to offer several such categories.

Charm++

Charm++ is a task-based asynchronous parallel runtime system used in applications such as ChaNGa and NAMD, among others. The runtime system has mechanisms to control the placement, scheduling, and execution target of the program, all of which could use HHAT components when available. A key property is that these properties are generally decided dynamically during execution by the runtime system, and retaining this dynamism is critical for the ideology of the software. Of particular interest is infrastructure for managing heterogeneous execution, data location, network communication, and (user and system level) threading.

VMD

VMD is a tool for preparing, analyzing, and visualizing molecular dynamics simulations and lattice cell simulations.

VMD has a large user community with over 100,000 registered users, and it is used on a broad range of hardware and operating system platforms that covers tablets, conventional laptops, PCs, as well as large clusters and supercomputers, and systems with and without GPU accelerators, recent many-core CPUs with wide vector units and some that have extensive hardware SMT. On clusters and supercomputers, VMD uses MPI for distributed memory execution. VMD implements a diverse range of internal algorithms and makes increasing use of third party software components and external libraries, particularly those that offer high performance platform-optimized algorithms, e.g., for linear algebra, FFTs, sequence alignment (bioinformatics), and others. VMD currently makes use of a light weight tasking system that originated in the Tachyon ray tracing library (that VMD also uses), which provides cross-platform APIs for threading, synchronization, and tasking. The tasking implementation in VMD and Tachyon incorporates special features to allow consideration of jobs that span both CPUs and GPU accelerators, handle unexpected runtime errors, and allow retry or relocation of task indices that contained work that could not be executed by an accelerator and migrates them to host CPUs, etc.

It should be a short-term effort (under a week?) to reimplement the key parts of the VMD/Tachyon tasking abstractions on top of a HiHat implementation, which would enable performance and usability comparisons to be made even with early HiHat implementations.

Add more

Template

Application 1

Brief description of app and its business importance
Brief description of app domain
Qualitative or quantitative analysis of where and how it would benefit from HHAT
Expected time table for delivery of a solution (e.g. readiness for the arrival of a new supercomputer at a USG lab), and resources available to implement it with HHAT
purpose: identify apps that could lead vehicles that drive the development of an open source project and that would be a poster child that would build confidence for others to follow

Applications: Difference between revisions

From Modelado Foundation