Applications: Difference between revisions
From Modelado Foundation
imported>Cnewburn (→VMD) |
imported>Cnewburn |
||
(2 intermediate revisions by one other user not shown) | |||
Line 54: | Line 54: | ||
= Categories of Hierarchical Algorithms = | = Categories of Hierarchical Algorithms = | ||
David Keyes | David Keyes, Kaust | ||
Computing at exascale represents a massive investment, and it makes little economic sense to approach exascale with algorithms that are not optimal in their data access. Exascale computers, like today’s petascale systems, will consist of many thousands of nodes, each of which has a fixed ratio of on-node storage to on-node bandwidth and processing capability. If the complexity of the application rises as a higher power than the first of problem size, that is, faster than linearly, then weak scaling will eventually fail, as the processing requirements will grow faster than the processing power. Fortunately, for many common classes of numerical computation, such optimal algorithms are known. They are typically log-lin or poly-log-lin in complexity; that is, for problem size N, the asymptotic complexity is O(N logp N). | |||
A breakthrough optimal numerical algorithm, shaving off what was previously a nonscaling power of N and replacing it with a constant or a logarithm of N has occurred approximately once per decade in the modern computing era. Indeed, the mathematical progress behind these algorithms was in each case explicitly motivated by a recognized need for higher performance. | |||
Some optimal numerical algorithms are: | |||
*Fast Fourier Transform (1960’s) | |||
*Multigrid (1970’s) | |||
*Fast Multipole (1980’s) | |||
*Sparse Grids (1990’s) | |||
*H matrices (2000’s) | |||
*Randomized algorithms (2010’s) | |||
A feature common to most optimal complexity algorithms, numerical and otherwise, is that they are hierarchical. They generally exploit a multiplicity of representations for all or parts of the primary data structures, with a hierarchical range of resolutions, and they exchange data among the processing elements over a hierarchical range of scales. Another feature of these methods is that, upon examination, they offer opportunities for implementations that are less synchronous than the default implementations with which they were born and introduced. | |||
Whether these methods achieve their optimal complexity depends upon the mathematical structure of the particular system they are asked to solve. There exist problems with structure in the PDE coefficients, matrix elements, or component orderings in which the asymptotic complexity lies only far out (e.g., past exascale), or is achieved only with very large constants. For such problems, a non-optimal algorithm may be faster for a smaller N. However, as computers and what is expected of them continue to scale with human imagination and technological advances, optimal hierarchical algorithms will deserve and receive prime attention for further mathematical extension and efficient asynchronous computational implementation. | |||
Editor's requests for enhancement to this entry, for the community: Please cite specific examples, support their market significance | |||
= Charm++ = | = Charm++ = | ||
Line 141: | Line 159: | ||
a GPU and onto the available host CPUs, or a different subset of GPUs that | a GPU and onto the available host CPUs, or a different subset of GPUs that | ||
either have more resources, or similar. | either have more resources, or similar. | ||
= Transport Monte Carlo = | |||
Point of Contact: [mailto:richards12@llnl.gov David Richards] | |||
== Description == | |||
Transport Monte Carlo solves the problem of tracking particles through a | |||
material. Interactions between the particles and the background | |||
material are sampled randomly from experimentally determined | |||
cross-section functions that give probability distributions of various | |||
physical events such as scattering, absorption, etc. By tracking many | |||
particles (often millions or billions) it is possible to obtain a | |||
statistically correct solution to the transport problem. | |||
== Business Relevance == | |||
Transport codes are of fundamental interest to NNSA. | |||
== Target relevance == | |||
Problems can require 10 to 10,000 nodes or even more for hero runs. | |||
Codes typically employ spatial domain decompostion and sophisticated | |||
load balancing schemes to balance both internode and intranode work. | |||
When running on GPU/CPU nodes it is likely that both CPUs and GPUs will | |||
track particles. | |||
== Functionality of the code == | |||
* Bulk of work sits in a single loop over the particles. | |||
* What processing happens for each particle is based on random selection. | |||
* Some codes track particles thorugh all process until it "dies" (particle exits the simulation domain or is absorbed). Others track the particle to the end of a time step (usually multiple segements per time step). | |||
* Particles do move across domain boundaries. | |||
== Current code == | |||
* Main loop is a "while more work" loop. Particles may be added to the work queue during the loop due to either communication from other domains or particle creation processes such as fission. This contains all code to handle all possible particle processes so there are effectively many 10,000's of lines of code in this loop. | |||
* At top of loop body, check several conditions, e.g. are communication buffers are nearly full so I need to process them, or load balance on a multi-core architecture, or process the next particle. | |||
== Possible task-based structure == | |||
* New loop body tests conditions and either triggers events or invokes async tasks | |||
* Tasks could include classification of particles by next event, packing execution buffer, sending data, processing data, receiving data, ~ | |||
* Note that some events could be even based, such as a buffer becoming full. Async event handlers could potentially trigger such tasks. | |||
== Characterization == | |||
Total # of tasks: All particles are independent, and most other | |||
operations can be asynchronous so there are potentially 100,000's of | |||
tasks per node. However, a single particle is probably too small a | |||
workload for a single task so bundling multiple particles is likely. | |||
Approximate granularity of tasks: Unfortunately, the tracking task will | |||
likely have 10,000's of lines of code. Massive code refactoring would be | |||
needed to separate each possible particle event into a separate task and | |||
the resulting tasks would almost certainly be too fine grained for | |||
efficient execution. | |||
Range of amount of data needed per task: Particles are typically 100-200 | |||
bytes each. Some data from the back ground geometry or mesh will be | |||
needed. Individual mesh elments aren't big, but due to scatting it is | |||
difficult to predict which elements are needed. Cross section | |||
(probability distribution) data can range from 10's to 100's of | |||
kilobyles to several gigabytes depending on the method chosen to | |||
represent the data tables. | |||
= Add more = | = Add more = |
Latest revision as of 17:01, April 17, 2017
The propose of this page is to gather user applications that serve as poster children for HHAT.
A link to get back up to the parent page is here.
Please this this approach
- Create a new subsection for each application, with two equal signs and a space around the title of each app
- Include the content in the template below
CORAL apps
Collaboration of Oak Ridge, Argonne and Livermore
APEX apps
Alliance for Application Performance at Extreme Scale
ECP apps
Exascale computing project
PASC apps
Platform for Advanced Scientific Computing, Switzerland
Shortly there will be more information about this initiative, but generally speaking Switzerland is founding development projects for libraries and applications with strong performance-portable code in areas like weather and climate, material science and molecular dynamics. Many computational kernels there are based on linear algebra, some other are based on finite differences.
GridTools
One of the projects is producing a set of (C++ header-based) libraries for finite differences in weather and climate applications. The main idea is to allow for the numeric operators to be expressed in a possibly grid-agnostic way, while the grid, wether representing a local region or the whole globe, is plugged in a second step. The libraries provide means of composition for the different operators , so that to allow to increase the computation-intensity of otherwise memory bound stencils, allowing specification of boundary conditions in a very flexible way, perform nearest neighbor communication operations, and domain decomposition. The central component of the set of libraries is the composition of different operators. All of the libraries, however, have backends to execute the requested tasks on specific architectures. Currently supported are x86-based multicores and nVidia GPUs. Xeon Phi is in a early stage of implementation. A plan to orchestrate the different activities (stencil execution, boundary conditions, communications) using some for of dynamic scheduling is one of the goals we are pursuing. Employing a more dynamic execution policy for each computational phase is not currently considered a urgent matter, since the scheduling of the operations is basically known at compile time. Future directions may include adopting a more dynamic approach in both high- and low-levels if such an integration is beneficial for performance.
ISV apps
Sandia's Task-DAG R&D 2014-2016
Sandia conducted a three year laboratory directed research and development (LDRD) effort to explore on-node, performance portable directed acyclic graph (DAG) of tasks parallel pattern, usage algorithms, application programmer interface, scheduling algorithms, and implementations. Of significance this LDRD used C++ meta-programming to achieve performance portability across CPU and NVIDIA GPU (CUDA) architectures. The above document is the final report for this R&D.
The prototype developed through this LDRD is currently (2017) being matured (overhauled) to address performance issues and elevate to production quality. This effort is scheduled for delivery within Kokkos by September 2017.
TRALEIKA GLACIER X-STACK Project
The XStack Traleika Glacier (XSTG) project was a three-year research award for exploring a revolutionary exascaleclass machine software framework. The XSTG program, including Intel, UC San Diego, Pacific Northwest National Lab, UIUC, Rice University, Reservoir Labs, ET International, and U. Delaware, had major accomplishments, insights, and products resulting from this three-year effort.
Its technical artifacts were primarily 1) a novel hardware architecture (Traleika Glacier) and a simulator for this architecture, 2) a specification of a DAG parallel, asynchronous tasking, low-level runtime called the Open Community Runtime (OCR), 3) several implementations of OCR including a reference implementation and the PNNL optimized OCR implementation (P-OCR), 4) the layering of several higher level programming models on top of OCR including CnC, HClib, and HTA, and 5) implementation of several DoE mini-apps and other applications on top of OCR or the higher level programming models.
Apps included:
- Smith Waterman
- Cholesky decomposition
- Two NWChem kernels (Self-Consistent Field and Coupled Cluster methods)
- CoMD
- HPGMG
Habanero Tasking Micro-Benchmark Suite
This micro-benchmarking suite is a work-in-progress intended to compare low-level overheads across low-level tasking runtimes (e.g. Realm, OCR). The above Github page includes a high-level description of each micro-benchmark, as well as source code for each micro-benchmark across a variety of low-level runtimes. These micro-benchmarks were curated across performance regression suites from a variety of tasking runtimes, and so is intended to enable one-to-one comparison of runtime efficiencies (as much as possible).
Categories of Hierarchical Algorithms
David Keyes, Kaust
Computing at exascale represents a massive investment, and it makes little economic sense to approach exascale with algorithms that are not optimal in their data access. Exascale computers, like today’s petascale systems, will consist of many thousands of nodes, each of which has a fixed ratio of on-node storage to on-node bandwidth and processing capability. If the complexity of the application rises as a higher power than the first of problem size, that is, faster than linearly, then weak scaling will eventually fail, as the processing requirements will grow faster than the processing power. Fortunately, for many common classes of numerical computation, such optimal algorithms are known. They are typically log-lin or poly-log-lin in complexity; that is, for problem size N, the asymptotic complexity is O(N logp N).
A breakthrough optimal numerical algorithm, shaving off what was previously a nonscaling power of N and replacing it with a constant or a logarithm of N has occurred approximately once per decade in the modern computing era. Indeed, the mathematical progress behind these algorithms was in each case explicitly motivated by a recognized need for higher performance.
Some optimal numerical algorithms are:
- Fast Fourier Transform (1960’s)
- Multigrid (1970’s)
- Fast Multipole (1980’s)
- Sparse Grids (1990’s)
- H matrices (2000’s)
- Randomized algorithms (2010’s)
A feature common to most optimal complexity algorithms, numerical and otherwise, is that they are hierarchical. They generally exploit a multiplicity of representations for all or parts of the primary data structures, with a hierarchical range of resolutions, and they exchange data among the processing elements over a hierarchical range of scales. Another feature of these methods is that, upon examination, they offer opportunities for implementations that are less synchronous than the default implementations with which they were born and introduced.
Whether these methods achieve their optimal complexity depends upon the mathematical structure of the particular system they are asked to solve. There exist problems with structure in the PDE coefficients, matrix elements, or component orderings in which the asymptotic complexity lies only far out (e.g., past exascale), or is achieved only with very large constants. For such problems, a non-optimal algorithm may be faster for a smaller N. However, as computers and what is expected of them continue to scale with human imagination and technological advances, optimal hierarchical algorithms will deserve and receive prime attention for further mathematical extension and efficient asynchronous computational implementation.
Editor's requests for enhancement to this entry, for the community: Please cite specific examples, support their market significance
Charm++
Charm++ is a task-based asynchronous parallel runtime system used in applications such as ChaNGa and NAMD, among others. The runtime system has mechanisms to control the placement, scheduling, and execution target of the program, all of which could use HHAT components when available. A key property is that these properties are generally decided dynamically during execution by the runtime system, and retaining this dynamism is critical for the ideology of the software. Of particular interest is infrastructure for managing heterogeneous execution, data location, network communication, and (user and system level) threading.
VMD
VMD is a tool for preparing, analyzing, and visualizing molecular dynamics simulations and lattice cell simulations.
VMD has a large user community with over 100,000 registered users, and it is used on a broad range of hardware and operating system platforms that covers tablets, conventional laptops, PCs, as well as large clusters and supercomputers, and systems with and without GPU accelerators, recent many-core CPUs with wide vector units and some that have extensive hardware SMT. On clusters and supercomputers, VMD uses MPI for distributed memory execution. VMD implements a diverse range of internal algorithms and makes increasing use of third party software components and external libraries, particularly those that offer high performance platform-optimized algorithms, e.g., for linear algebra, FFTs, sequence alignment (bioinformatics), and others. VMD currently makes use of a light weight tasking system that originated in the Tachyon ray tracing library (that VMD also uses), which provides cross-platform APIs for threading, synchronization, and tasking. The tasking implementation in VMD and Tachyon incorporates special features to allow consideration of jobs that span both CPUs and GPU accelerators, handle unexpected runtime errors, and allow retry or relocation of task indices that contained work that could not be executed by an accelerator and migrates them to host CPUs, etc.
It should be a short-term effort (under a week?) to reimplement the key parts of the VMD/Tachyon tasking abstractions on top of a HiHat implementation, which would enable performance and usability comparisons to be made even with early HiHat implementations.
VMD HiHat Requirements and User Story
Library solution
A library based approach is strongly preferred because VMD already incorporates a multiplicity of compilers for its various component software. At typical x86 Linux build of VMD is compiled with Intel C/C++, NVIDIA NVCC, Intel ISPC, and GNU C/C++, and final linkage is done with Intel C/C++. It is already a challenge to manage complex compiler, OS, and library compatibility issues with this present arrangement, and adding the need to use yet another compiler or compiler/language extensions would likely complicate this even further.
Platform support
Since VMD runs on a very wide range of hardware and OS combinations that range from tablets, laptops, desktops all the way to clusters and supercomputers, HiHat must by definition be usable across all of those platforms, although the various criteria for optimal user experience, performance, energy efficiency, etc are different across these platforms.
Language binding
C binding might be preferable. C++ APIs and objects have a tendency to become "viral", potentially affecting far reaching parts of application code unless the application uses its own abstraction layer to limit this. That said, I could envision some nice benefits from C++ bindings if they allowed improved interoperability between HiHat and current and future C++ language features, and the like.
Performance criteria
I prefer most overheads to be shifted toward application startup, particularly when it comes to enumerating NUMA hardware topology, accelerators, and building up persistent hardware thread pools, CUDA streams, and other resources that are required later on in the midst of performance critical task scheduling activities.
Some CPU algorithms achieve peak performance with a mapping of tasks indices to a number of hardware threads that matches the number of physical CPU cores, that is, without any use of additional SMT threads. These algorithms are typically very arithmetic intensive, and therefore the use of SMT is unnecessary to hide memory latencies, and in fact use of SMT may slow performs. On some CPUs it can be the case that use of SMT comes at a cost in terms of register spills or other scarce on-chip resources, so the calling application needs to have some control over whether and to what degree SMT is used to schedule tasks (e.g. what SMT depth is used).
VMD algorithms often make use of finite capacity on-chip or on-board memory systems e.g. KNL MCDRAM, GPU HBM, constant/texture caches, etc. When scheduling multiple independent queues onto the available processors (CPUs, GPUs, ...) it should be possible for the application to exploit innate deep knowledge of required working set size within these finite memory systems to inform HiHat scheduling so that runtime errors don't occur, and performance is maximized.
Handling of exceptions, runtime errors, underlying hardware errors/failures
There are a variety of scenarios where it is best for performance to design algorithms for a GPU or many-core wide-vector CPUs with a "optimistic and greedy" approach, that is they schedule all work units on target devices with the assumption that exceptions or problematic work units are extremely rare. In such cases one can often achieve higher performance by launching the work optimistically, and having a CPU "cleanup thread" deal with work units that encountered exceptional arithmetic conditions, resource limitations, or other cases where special treatment might be required that is not well suited to the kernel implementation running on the GPU. In such cases the cost of cleaning up the rare exceptions is usually lower than if the CPU investigated the work units and rearranged or compacted the task set submitted to the GPU to explicitly avoid the problematic ones.
Thinking in terms of future hardware and robust execution in the face of hardware errors, it should perhaps be possible to handle transient hardware faults or resource exaustion scenarios at some level of granularity, perhaps allowing affected tasks to be rescheduled off of an accelerator such as a GPU and onto the available host CPUs, or a different subset of GPUs that either have more resources, or similar.
Transport Monte Carlo
Point of Contact: David Richards
Description
Transport Monte Carlo solves the problem of tracking particles through a material. Interactions between the particles and the background material are sampled randomly from experimentally determined cross-section functions that give probability distributions of various physical events such as scattering, absorption, etc. By tracking many particles (often millions or billions) it is possible to obtain a statistically correct solution to the transport problem.
Business Relevance
Transport codes are of fundamental interest to NNSA.
Target relevance
Problems can require 10 to 10,000 nodes or even more for hero runs. Codes typically employ spatial domain decompostion and sophisticated load balancing schemes to balance both internode and intranode work.
When running on GPU/CPU nodes it is likely that both CPUs and GPUs will track particles.
Functionality of the code
- Bulk of work sits in a single loop over the particles.
- What processing happens for each particle is based on random selection.
- Some codes track particles thorugh all process until it "dies" (particle exits the simulation domain or is absorbed). Others track the particle to the end of a time step (usually multiple segements per time step).
- Particles do move across domain boundaries.
Current code
- Main loop is a "while more work" loop. Particles may be added to the work queue during the loop due to either communication from other domains or particle creation processes such as fission. This contains all code to handle all possible particle processes so there are effectively many 10,000's of lines of code in this loop.
- At top of loop body, check several conditions, e.g. are communication buffers are nearly full so I need to process them, or load balance on a multi-core architecture, or process the next particle.
Possible task-based structure
- New loop body tests conditions and either triggers events or invokes async tasks
- Tasks could include classification of particles by next event, packing execution buffer, sending data, processing data, receiving data, ~
- Note that some events could be even based, such as a buffer becoming full. Async event handlers could potentially trigger such tasks.
Characterization
Total # of tasks: All particles are independent, and most other operations can be asynchronous so there are potentially 100,000's of tasks per node. However, a single particle is probably too small a workload for a single task so bundling multiple particles is likely.
Approximate granularity of tasks: Unfortunately, the tracking task will likely have 10,000's of lines of code. Massive code refactoring would be needed to separate each possible particle event into a separate task and the resulting tasks would almost certainly be too fine grained for efficient execution.
Range of amount of data needed per task: Particles are typically 100-200 bytes each. Some data from the back ground geometry or mesh will be needed. Individual mesh elments aren't big, but due to scatting it is difficult to predict which elements are needed. Cross section (probability distribution) data can range from 10's to 100's of kilobyles to several gigabytes depending on the method chosen to represent the data tables.
Add more
Template
Application 1
- Brief description of app and its business importance
- Brief description of app domain
- Qualitative or quantitative analysis of where and how it would benefit from HHAT
- Expected time table for delivery of a solution (e.g. readiness for the arrival of a new supercomputer at a USG lab), and resources available to implement it with HHAT
- purpose: identify apps that could lead vehicles that drive the development of an open source project and that would be a poster child that would build confidence for others to follow