This wiki page is used to keep minutes from phone and face to face meetings on the topic of usage models, user stories, and applications for heterogeneous hierarchical asynchronous tasking. Most recent meetings are listed on top. See [Presentations] for the ongoing agenda for monthly meetings and the materials that were posted.

A link to get back up to the parent page is here.

BoF, Active Messages, POC Update, July 18, 2017

Attendees (41): Wilf, Bala Seshasayee (Intel), Kath Knobe, Marcin Zalewski, Max Grossman, Rob Vander Wijngaart, Wael Elwasif (ORNL), Andrew Lumsdaine, BillF, Hartmut Kaiser, Jesun (PNNL), John Feo (PNNL), John Stone, Michael Garland, Michael Wong, Millad Ghane, minu455, Oscar Hernandez, Patrick Atkinson, Ron Brightwell, Siegfried Benkner (U Vienna), Sunita Chandrasekaran, Szilard Pall, Vincent Cave, Ashwin Aji (AMD), John Biddiscombe (CSCS), Naoya Maruyama, Kamil Halbiniak, Roman Wyrzykowski, CJ Newburn, Mike Bauer

BoF - BillF
- SC17 BoFs due 2 weeks from today
- 60min mid day or 90min evening

Active messages - Marcin Zalewski
- Overview, benefits
- Active Pebbles/AM++ and GMT currently on MPI, tiny messages, latency/throughput trade-offs
- Could be layered on PGAS - put with completion
- Would like to look into integration with HiHAT
  - Layering above HiHAT, mapping to transports under HiHAT
  - Triggering upon receipt of transferred data
  - Consider whether there are ordering requirements, e.g. fence/quiet
- RonB: biggest differentiator of active messages is on receiver side
- HartmutK: integration with scheduler is interesting part

Proof of Concept Update - Millad Ghane (U Houston), John Stone (UIUC)
- Preliminary implementation to demonstrate feasibility
- Implemented basics: data movement, data management, invocation
- Current target platform: 2 CPU sockets, 2 GPUs
- HiHAT overheads relative to CUDA for microbenchmarks: lesser of 2% or 500ns
- User/Common Layer overhead for table look ups: 60ns
- Molecular Orbitals app running on top
  - 3% HiHAT API overhead
  - 1.2x reduction in static APIs and unique APIs
  - Porting time to HiHAT was 90 minutes

More users, proof of concept plans, high-level design doc, June 20, 2017

Attendees (43): Antonio Tumeo, Ashwin Aji, BillF, David Bernholdt, DebalinaB, DGenet, Dmitry Liakh, Firo017, Gordon Brown, Hans Johansen, James Beyer, Jesun (PNNL), Jiri Dokulil, John Stone, Kamil Halniniak and Roman, Kath Knobe, Keeran Brabazon (ARM), Mauro Bianco, Michael Garland, Mike Bauer, Mike Chu, Millad Ghane, Minu455, Naoya Maruyama, Piotr, Rob Neely, Ronak Buch, Ruyman Reyes, Samuel Thibault, SharanA (NVIDIA Tegra), Siegfried Benkner (U Vienna/StarPU), Stephen Olivier, Szilard Pall, Thomas Herault, Tim Blattner, Vincent Cave, Wael Elwasif (ORNL), CJ, ...

Some new participants: NIST/HTGS, UINTAH, StarPU, more from NVIDIA, e.g. automotive
- Tim Blattner presented slides (see Presentations)
Proof of concept
- Review of POC plan doc (see Presentations)
- John Stone, VMD and molecular orbitals
- Some discussion of the benefits of dynamic scheduling
- There's a value to progressive back off on the dynamism of scheduling, potentially based on profile-driven need - John, Szilard, Wilf
High-level design doc (see Presentations)

Mini-Summit Synthesis, May 16, 2017

Attendees

Carter, David Bernholdt, George, Max, Michael Garland, Michael Robson (PPL/UIUC); Millad, Patrick, Piotr, Thomas Herault, Dmitry, Szilard, Toby, Wael, Damien, Andrew, Ashwin, Jiri Dokulil, Naoya Maruyama, Oscar, Pietro Cicotti, Wilf, CJ

Welcome, intro
DHPC++ review
- Compare/contrast with OpenCL, OpenVX, Vulcan
Mini-Summit review
- Who gathered
- Slides should be integrated, some updates
- Overview
- Tabulation of results
- Review of poll/ratification
  - This broader audience also ratified what was listed
  - How do you connect different MPI worlds?
  - Clarify that HiHAT has to stage data across sub-clusters
  - Clarify granularity of work
- Sampling of requirements
  - Active messages (PNNL, Andrew)
  - Futures with data (OCR, Vincent; HPX, Hartmut Kaiser)
  - Callbacks on completion (OCR, Vincent)
  - Dynamic compilation (R-Stream, KART, LLVM)
  - Graph reuse (SWIFT/QuickShed, Stephen) - later
  - Partial I/O (SWIFT/QuickShed)
  - Feedback for auto-tuning (TensorRT)
  - Reproducibility via control
- Additional key issues to debate
Who else should be drawn in
- OpenVX
- Vulcan
- StarPU
- UINTAH
Topics for the future
- Portability, content of tasks - Carter
- Task scheduling for accelerators, SMP - Szilard
- Interoperation, remerging with other efforts, e.g. OpenCL, OpenMP - Szilard
- Performance analysis and monitoring APIs - Oscar
- Defining terms, e.g. future

Partner interactions, NVIDIA usage models, AMD efforts, Apr 18, 2017

Attendees (37) included: Wilf, Kath Knobe, Millad Ghane, CJ Newburn, Bill Feiereisen, Dmitry Liakh, Gordon Brown, Jesmin Tithi, Jans Johansen, Jiri Dokulil, John Feo, Kelly Livingston, Max grossman, Mauro Bianco, Oscar Hernandez, Patrick Atkinson, Piotr Luszczek, Ron Brightwell, Ruyman Reyes, Stephen Jones, Stephen Olivier, Sunita Chandrasekaran, Szilard Pall, Wael Elwasif, Ashwin Aji, George Bosilca, John Stone, Benoit Meister, Andrew Lumsdaine, Mike Bauer, Tim

CJ offered some recent highlights of partner interactions

ARM, AMD, IBM, NVIDIA engaging
User story, requirements and app updates
- Jim Phillips, NAMD; John Stone, VMD; Ronak Buch on Charm++, David Richards on transport Monte Carlo
- David Keyes, on categories of hierarchical algorithms
DHPC++ workshop, Toronto, May 16; will be a talk on HiHAT https://easychair.org/cfp/dhpcc17
Performance portability workshop, week of Aug 21, is expected to have some coverage of HiHAT

Upcoming HiHAT Mini-Summit

See info here

Teaser on NVIDIA usage models, Stephen Jones, NVIDIA

NVIDIA interested in HiHAT to broaden access of codes to resources in hetero platforms
Also for AI: deep learning and inference have available tasked-based parallelism
- Offered some background on DNN, RNN
As the lower bound on task granularity drops, more task parallelism may be available
Two ways to leverage fine-grained tasks better:
- reduce overheads for actions like invocation and moving data, instigated by CPU and performed on GPU --> lower-overhead Common Layer
- aggregate tasks in sequences and sub-graphs, that are passed down to target for localized handling --> richer tasking abstractions
Common requirements induced by inference and deep learning for HiHAT

Teaser on relevant AMD efforts, Ashwin Aji, AMD

ROCm = Radeon Open Compute, rebranding of HSA
- Similar to common layer, thin API that abstracts underlying compute and memory HW
- Task descriptors, lock-free data structures, door bells that trigger task execution
ATMI = Async Tasking and Memory Interface
- Kinds of tasks, low-latency signaling among tasks
Links
- ATMI info: http://gpuopen.com/compute-product/atmi/
- ATMI github: https://github.com/RadeonOpenCompute/atmi
- ROCm platform info: http://gpuopen.com/compute-product/rocm/
- ROCR (Runtime) API: https://github.com/RadeonOpenCompute/ROCR-Runtime

HiHAT overview, PaRSEC, Mar 21, 2017

Attendees included Wilf Pinfold, Benoit Meister, Patrick Atkinson, Schumann, George Bosilca, Piotr Luszczek, JimPhillips, Stephen Olivier, Max Grossman, Bill Feiereisen, Dmitry Liakh, Wael Elwasif, Jiri Dokulil, Gordon Brown, John Stone, Andrew Lumsdaine, Thomas Herault, Ronak Buch, Ashwin Aji, Bala Seshasayee, Michael Garland, Damien Genet, Aurelien Bouteiller, Oscar Hernandez, PSZ - Paul Szillard?, Timo (Blue Brain), Kelly Livingston, Antonio Tumeo, CJ, several more

CJ gave a HiHAT Overview

Progress in funding, e.g. from US government and vendors
Several posts to web, including from PASC, Charm++, VMD, Habanero tasking micro-benchmark suite
Upcoming report out on progress at GTC, morning of May 9 in San Jose
- Usage models and requirements
- Reveal initial progress on prioritized HiHAT interface design
Highlighted SW architecture of HiHAT, especially regarding pluggable modules, user layer with target-specific decision making with ease of use, and common layer that dispatches to target-specific implementations of actions
Call for more participation in identifying prioritized functionality of HiHAT to leverage, specific requirements and interfaces

George Bosilca of U Tennessee gave an overview of PaRSEC interaction with HiHAT

Data-centric programming environment based on async tasks executing on a hetero distributed environment
Offers a domain-specific language interface
Delivers good performance and scalability
SW architecture is based on modular component architecture of Open MPI, so it's quite amenable to plugging in HiHAT implementations for some of its functionality.
Prioritized wish list
- Portable and efficient API/library for accelerator support - data movement, tasks
- Portable, efficient and inter-operable communication library (UCX, libFabric, …)
  - Moving away from MPI will require an efficient datatype engine
  - Also supported by rest of the software stack (for interoperability)
- Resource management/allocation system
  - PaRSEC supports dynamic resource provisioning, but we need a portable system to bridge the gap between different programming runtimes
- Memory allocator: thread safe, runtime defined properties, arenas (with and without sbrk). (memkind?)
- Generic profiling system, tools integration
- Task-based debugger and performance analysis

Items for potential discussion and investigation

Enumeration - look at interaction with HWLOC
Dealing with unstructured data and data types
Data versioning
Serialized streams and subsequences of actions; may want cancellation
Resilience - detection, propagation
Interfaces for data movement, how that relates to MPI, collectives

OCR Review, Feb. 21, 2017

Wilf: Presentation material out on the wiki: OCR usage models is the one for today
Bala - OCR (Open Community Runtime), presents overview of OCR

Wilf: How do you decide on granularity of the task breakdown for AutoOCR? Is there some sort of input file?
Bala: Granularity is entirely the choice of the developer. AutoOCR is pretty straightforward - use a keyword to indicate that a task should be an EDT and annotate data blocks. Compiler will follow that and decorate with OCR API. It makes no decisions regarding granularity for itself. Compiler path is implemented in LLVM which looks at the keywords and generates OCR code.

Wilf: With MPI-Lite can you get some resiliency that you can't get from MPI?
Bala: That's interesting; we've not tried it. Resiliency & MPI-Lite have each been tried in isolation but not together.
Stephen Jones: How do people usually port to OCR?
Bala: People usually try to see if their MPI code can adapt to OCR. Will sacrifice performance while they see if they can implement in OCR. Some constructs like MPI_Wait are not aligned with OCR (which assumes an EDT can run to completion). Once people have adapted to OCR then there's no more reason to run MPI at all - they'll then restructure their program to reduce bottlenecks once they have a much better view of the dataflow graph.
CJ: What about continuation-style semantics.
Bala: A constant back-and-forth: should we stick to the "pure" model of no waits or stalls once a task has started? This would mean we need to split the task around a stall, but would also make data management complex between tasks. Some have looked at continuation semantics as a way to wait & context-switch within a task: moves the complexity into the runtime, which has to implement the continuation. Not many people have been trying this yet.
CJ: That's what Argobots & Qthreads are going after. HiHAT is looking to layer these on top of it to manage such continuations.
Bala presents on app requirements support
Wilf: What's performance looking like right now for e.g. MPI-Lite? How heavy is the task-based overhead at this time?
Bala: For MPI-Lite we've not put any effort into performance, because it's not trying to compete with MPI. OCR uses MPI for communication in this mode.Numbers look promising. At 16k cores OCR does not appear to perform any worse than MPI.
Wilf: How does resiliency play into this, if you've got 16k cores for example?
Bala: Not tried it at that scale yet. It will obviously slow things down. Has been tried out in isolation but not mixed together with performance yet.
Wilf: What about load-balancing? Was that 16k run fairly regular?
Bala: Again, have not yet tried this out in an application. In isolation, have used it at 64-node scale.
- Have tried it out with Mini-AMR and seen some good results but still wrestling with heuristics that are needed. More heuristic intelligence does not seem to provide a lot of benefit because of the overhead of coming up with intelligent heuristics.
Stephen Olivier: Do you have any full-sized apps you have results for?
Michael Wong: Do you have a regular OCR call?
MW: Have you looked at any bottlenecks inside OCR?
Bala: One of the things we're already aware of is the GUID implementation. Making it globally unique can be expensive and in practice you don't always need it to be truly global around the cluster: you only need uniqueness spatially or temporally. Suggests two types of GUID: truly global, and then more local UID.
- Can also probably shave off some overhead in event management (Legion has managed this, for example). You can often re-use events without the overhead of creation/destruction.
Wilf: Here's where we are with the meetings
- We've been using EventBrite for registration but it's getting a bit awkward. Trying to move over to MailChimp. We've got about 69 in the group (30 on the call today).
- Everyone will receive an email in the next week for registration. Use that to register, not EventBrite, in future please.
- Wiki will be kept with link to database of MailChimp info
CJ: Some higher comments & contexts
- Upcoming talks will look at the apps/algos which will be layered on top of HiHAT.
- Lots of good work in progress - appreciate people contributing and sharing
Michael Wong: One thing he's looking at is developing heterogeneous C++. If the group is interested he can send out some information about that. Also going to be running a workshop on ISO C/C++ and other high level heterogeneous C++ programming models here.
CJ: Want to look at these things and decide "would these be called BY HiHAT, or built on top of HiHAT?"
MW: Do have models which can build on top of HiHAT. Can have discussion at a later meeting.

Community meeting, Jan 17, 2017

Agenda

Welcome: Wilf Pinfold
Overview, purpose
Solicit apps that need hierarchical tasking
Solicit usage models
- Fully dynamic to semi-static - Pall
Solicit user stories (requirements)
- Map tasks to multiple GPUs - Dmitry
- Granularity - Pall
- Finite memory - Carter; see "Sandia" on Applications page
- Distributed data structures in finite memory - Toby
- For latency sensitivity apps, anything overheads need to be offset by significant gains - Pall
- Hierarchical topology - Toby
- Building libs for finite physical memory; libs cooperating with caller, e.g. via callbacks - John Stone
- Aggregated task groups, recursive task model that enables decomposition - Dmitry, Ashwin Aji
- Data affinity-driven binding and scheduling and data decomposition - Pall
- Move work to data vs. other way around - John
- PGAS support, data affinity and decomposition - Toby
Housekeeping - Wilf

Participants included: Wilfred Pinfold - creator, John Stone, umit@gatech.edu, Wael Elwasif, xg@purdue.edu Xinchen Guo, belak1@llnl.gov, Ruymán Reyes, pa13269@bristol.ac.uk Patrick Atkinson, Max Grossman, gordon@codeplay.com, bala.seshasayee@intel.com, mbianco@cscs.ch, ashwin.aji@amd.com, khalbiniak@icis.pcz.pl - Kamil Halbiniak, roman@icis.pcz.pl - Roman Wyrzykowski, fabien.delalondre@epfl.ch, richards12@llnl.gov, pszilard@kth.se - Pall, Michael Wong, Shekhar Borkar, David Bernholdt, rabuch2@illinois.edu, bill@feiereisen.net, cnewburn@nvidia.com, Piotr Luszczek, liakhdi@ornl.gov, Muthu Baskaran, jesmin.jahan.tithi@intel.com, slolivi@sandia.gov, hcedwar@sandia.gov - Carter, fuchst@nm.ifi.lmu.de - Toby, rbbrigh@sandia.gov - Ron

Signed up, but seemed not to make it: timothy.g.mattson@intel.com, schulzm@llnl.gov, oscar@ornl.gov[conflict], mbauer@nvidia.com, romain.e.cledat@intel.com, aiken@cs.stanford.edu, mfarooqi14@ku.edu.tr, lopezmg@ornl.gov, Benoit Meister, vgrover@nvidia.com, kelly.a.livingston@intel.com, alexandr.nigay@inf.ethz.ch, matthieu.schaller@durham.ac.uk, manjugv@ornl.gov, esaule@uncc.edu, schandra@udel.edu, cychan@lbl.gov, gshipman@lanl.gov, mgarland@nvidia.com, vsarkar@me.com, Didem Unat, maria.garzaran@intel.com, john.feo@pnnl.gov, mike.chu@amd.com, timothee.ewart@epfl.ch, jim@ks.uiuc.edu, n-maruyama@acm.org, pcicotti@sdsc.edu, kk13@rice.edu, srajama@sandia.gov

Kickoff, Dec. 20, 2016

Agenda

Welcome: Wilf Pinfold
Overview, purpose
Approach
Wiki explanation
Next steps
Feedback, expression of interest

Participants (33) included

BillF, CarterE, DavidR, Erik, JimP, KamilH & RomanW, PatrickA, PietroC, SenT, ShekharB, CJ, WilfP, StephenJ, XinchenG, TimM, RomainC, OscarH, AlexandrN, VinodG, KathK, Ashwin Aji, JoshF, GalenS, ManjuG, PallS, MariaG, ...
See calendar entry, if you signed up

Discussion

Glossary suggested by Tim, try not to invent new definitions
Report suggested by Oscar - summary of usage cases could be useful for DoE
How do we keep from getting fragmented? (Tim) Try to bringing community together by focusing on common requirements (Wilf)
Start with usage models, requirements, provisioning constraints, rather than comparing and contrasting specific implementations
We have data and experience to share
Looking to have a phone meeting 3rd Tue each month at 9am PST; some here had standing conflicts; Wilf to try a Doodle poll
Time scale, involvement, outputs?
Are we sold on async tasking? Driven more by efficiency on HW? (Shekhar) Yes (Oscar) Who needs it for what? We need compelling examples of where mainline DoE apps need it. (Dave Richards) Clever use of MPI goes a long way (Tim)
MPI: resilience not well addressed (Wilf) Comparison with MPI is inappropriate, tasking can be done on top of MPI, e.g. two-hot, accelerated MD. It's about the benefit of a computational model, which helps some and not others. (Galen) Tim agrees that MPI is low-level runtime.
Interesting to identify a set of apps that embody tasking, and understand why they chose that model (Galen) Sounds like a potential value proposition (Shekhar).
Characteristics: granularity of tasks - the finer the granularity the less portable the solution, explicit vs. implicit control (DaveR) If task relationships can be described, it can become more portable (Stephen) How will decomposition happen - expert, compiler, runtime? (DaveR)
How do we make this applicable to large, portable code bases, enabling productivity? Where does the tasking model emerge? (DaveR)
What does it mean to have an async environment, what are the critical features? (Josh)
The way to resolving differences at various levels may lie in hierarchy (Kath) Strongly agree with hierarchy (Tim)
Strongly agree with a bottom up approach, with a hierarchical perspective (Tim)

HHAT Usage Meeting Minutes

From Modelado Foundation

Contents