Performance Tools

Questions:

What abstractions does your runtime stack use for parallelism?
What performance information would application developers need to know to tune codes that use your X-Stack project's software?
What would a systems software developer need to know to tune the performance of your software stack?
What information should a performance tool gather from each level in your software stack?
What performance information can/does each level of your software stack maintain for inspection by performance tools?
What information would your software stack need to maintain in order to measure per-thread or per-task, performance? Can this information be accessed safely from a signal handler? Could a performance tool register its own tasks to monitor the performance of the runtime?
What types of performance problems do you want tools to measure and diagnose? CPU resource consumption? CPU utilization? Network bandwidth? Network latency? Contention for shared resources? Waste? Inefficiency? Insufficient parallelism? Load-imbalance? Task dependences? Idleness? Data movement costs? Power or energy consumption? Failures and failure handling costs? The overhead of resilience mechanisms? I/O bandwidth consumed? I/O latency?
What kinds of performance problems do you foresee analyzing using post-mortem analysis?
What kinds of performance problems do you foresee analyzing using runtime analysis? What interfaces will be needed to gather the necessary information?
What control interfaces will be necessary to enable runtime adaptation based on runtime performance measurements?
There is a gap between the application-level and implementation-level views of programming languages and DSLs. What information should your software layers (compiler and runtime system) provide to attribute implementation-level performance measurement data to an application-level view?
What kind of visualization and presentation support do you want from performance tools? Do you envision any IDE integration for performance tools?
List the performance challenges that you think next generation programming languages and models will face.

What abstractions does your runtime stack use for parallelism?
XPRESS	The XPRESS project identifies three levels of parallelism associated with the ParalleX execution model: a. The coarse grain parallelism is the ParalleX Process that provides context for other child processes and other forms of computation and may span multiple localities (i.e., nodes), b. The medium grain ParalleX compute complex (e.g., thread instantiation) that runs on a single locality and is a partially ordered set of operations, and c. The fine grain operations that are coordinated by a static dataflow graph (DAG).
TG	EDTs and data blocks.
DEGAS
D-TEC
DynAX	The basic unit of computation is the SWARM codelet. Codelets are grouped into SCALE procedures that allow the sharing of data across the codelets. On top of codelets and procedures, two higher level notations have been implemented: The Hierarchically Tiled Array object which includes data parallel operations and is implemented as a library, and Sequential C code auto-parallelized and translated by R-Stream into codelets.
X-TUNE
GVR
CORVETTE
SLEEC

What performance information would application developers need to know to tune codes that use your X-Stack project's software?
XPRESS	Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
TG	With the goal being separation of concerns, no platform-specific information need to be known by the application developer. They need to provide hints that describe the software, using appropriate runtime APIs, which the runtime uses to aid in appropriate resource management.
DEGAS
D-TEC
DynAX	Cost of memory accesses across the different levels of the hierarchy, overhead associated with codelet initiation and coordination, scheduling strategies.
X-TUNE
GVR
CORVETTE
SLEEC

What would a systems software developer need to know to tune the performance of your software stack?
XPRESS	Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
TG	The runtime exposes resource management modules (introspection, allocator, scheduler) using well-defined internal interfaces that can be replaced or tweaked by the systems software developer to target the underlying platform.
DEGAS
D-TEC
DynAX	Cost of communication across different levels of the hierarchy, cost of context switching, size of different memory levels, classes of processors, parameters of the scheduling strategy.
X-TUNE
GVR
CORVETTE
SLEEC

What information should a performance tool gather from each level in your software stack?
XPRESS	Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
TG	At application layer - application profiling, at runtime - resource management decisions and runtime overheads, and at simulation - detailed resource usage including monitoring exposed by hardware.
DEGAS
D-TEC
DynAX	Within the codelet: (a) Sequential performance of each codelet (e.g. in Gigaflops), and (b) Cost associated with memory accesses by instructions in a codelet. (e.g. how many accesses are cache hits, how many accesses go to scratch pad memory, how may to remote locations). Across codelets: (a) Overhead caused by both local and remote codelet initiation, cost of argument boxing, data communication costs (moving data), (b) Processor utilization and impact of the scheduling choices of the runtime stack, and (c) Other overhead of the runtime system.
X-TUNE
GVR
CORVETTE
SLEEC

What performance information can/does each level of your software stack maintain for inspection by performance tools?
XPRESS	Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
TG	Please see above.
DEGAS
D-TEC
DynAX	Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
X-TUNE
GVR
CORVETTE
SLEEC

What information would your software stack need to maintain in order to measure per-thread or per-task, performance? Can this information be accessed safely from a signal handler? Could a performance tool register its own tasks to monitor the performance of the runtime?
XPRESS	The APEX component of the HPX runtime system maintains the necessary information to measure per-thread performance including start, stop, suspend/pending event times, and ops counts. A performance tool can register its own tasks to monitor the performance of the runtime. Threads are first class objects and can be directly accessed by other threads.
TG	Currently, the runtime maintains these information at varying degrees of granularity, depending on developers' choice (from instruction & byte counts all the way to task statistics), and this information is available for offline analysis. Future work will allow a portion of this analysis to be made online, so that custom performance tool tasks are accommodated.
DEGAS
D-TEC
DynAX	Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
X-TUNE
GVR
CORVETTE
SLEEC

What types of performance problems do you want tools to measure and diagnose? CPU resource consumption? CPU utilization? Network bandwidth? Network latency? Contention for shared resources? Waste? Inefficiency? Insufficient parallelism? Load-imbalance? Task dependences? Idleness? Data movement costs? Power or energy consumption? Failures and failure handling costs? The overhead of resilience mechanisms? I/O bandwidth consumed? I/O latency?
XPRESS	All of the above and more.
TG	All the above mentioned, with the exception of I/O. In addition to these - runtime overheads at module-level granularity, memory use at different levels of the hierarchy, temperature & reaction time, DVFS & its effects.
DEGAS
D-TEC
DynAX	Yes. All of the above.
X-TUNE
GVR
CORVETTE
SLEEC

What kinds of performance problems do you foresee analyzing using post-mortem analysis?
XPRESS	Post-mortem information would be useful to analyze non-causal behavioral data that cannot be predicted prior to execution. It must also differentiate this information from that which is entirely data dependent and therefore likely to change from that which is an intrinsic property of the program. A determination of the critical path and side path tasks combined with energy and time consumption requirements for each task would be very useful.
TG	The primary problems diagnosed this way will be resource management decisions, and whether hints supplied by the program/compiler are internalized in decision making correctly. Additionally, runtime overheads will also be tracked closely.
DEGAS
D-TEC
DynAX	To find a balance between performance and power consumption. To have good resiliency for the system and the impact of faults on performance.
X-TUNE
GVR
CORVETTE
SLEEC

What kinds of performance problems do you foresee analyzing using runtime analysis? What interfaces will be needed to gather the necessary information?
XPRESS	The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
TG	DVFS decisions by the runtime, and its impacts will be analyzed using runtime analysis.
DEGAS
D-TEC
DynAX	To identify hardware/system/runtime bottlenecks, and to identify poor prioritization of the application's critical path. It's unclear what interfaces would provide sufficient introspection into the necessary systems without paying an unacceptable performance penalty, further discussion on this topic is welcome.
X-TUNE
GVR
CORVETTE
SLEEC

What control interfaces will be necessary to enable runtime adaptation based on runtime performance measurements?
XPRESS	The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
TG	This is an ongoing work, with scalability being the emphasis (since the metrics will provide huge volumes of data that will be hard to manage). Currently statistical properties of metrics is being considered to be used as proxies for various underlying causes. The interfaces are predominantly those exposed by hardware to the runtime (via counters), and minimal interfaces provided to resource management modules by the runtime.
DEGAS
D-TEC
DynAX	Execution time, energy consumption, cache miss ratio, fraction of accesses to each class of remote memory, latency of memory accesses, network collisions.
X-TUNE
GVR
CORVETTE
SLEEC

There is a gap between the application-level and implementation-level views of programming languages and DSLs. What information should your software layers (compiler and runtime system) provide to attribute implementation-level performance measurement data to an application-level view?
XPRESS	For purposes of performance portability, the principal information required from programmer level is parallelism and some relative locality information. It is possible that some higher-level idiomatic patterns of control and access may be useful but these have as yet to be determined.
TG	The runtime provides implementation-level performance at the runtime API level. Source transformations from high level application to runtime API would also need to provide mechanisms to also reverse-map the runtime-provided information at the implementation, level back to high level application. Currently, the implementation-level details can still be mapped back to application design with basic level of familiarity with the transformation tools.
DEGAS
D-TEC
DynAX	In the case of HTA, the cost of each HTA operation decomposed into computation (sequential and parallel), communication cost, locality (cache misses or equivalent if scratch pad memories are used), network congestion, energy consumption.
X-TUNE
GVR
CORVETTE
SLEEC

What kind of visualization and presentation support do you want from performance tools? Do you envision any IDE integration for performance tools?
XPRESS	Visualization of resource usage and pending (bottlenecked) work will help to inform about intrinsic code parallelism and precedent constraints.
TG	Some high level transformation tools already have a graphical representation of the program abstractions. Additionally, we also have a graphical representation of data movement and energy consumption at the simulator level. These will be enhanced to accommodate other performance metrics currently being tracked. IDE integration has not been a focus so far, but will be considered once the toolchain attains maturity.
DEGAS
D-TEC
DynAX	A system like the parallel study and thread profiler of Intel.
X-TUNE
GVR
CORVETTE
SLEEC

List the performance challenges that you think next generation programming languages and models will face.
XPRESS	Overhead and its impact on granularity, diversity of forms and scales of parallelism, parallelism discovery from meta data, energy suppression.
TG	Based on our choice of the EDT (Event Driven Tasks) model, a primary challenge will be to ensure that the resource management overheads do not nullify the gains got due to the extra parallelism the model enables. We plan to address this by settling on the right granularity of task length and data block sizes so that the overheads are kept low, and the right balance between parallelism and management overheads is struck.
DEGAS
D-TEC
DynAX	Tuning for a complex target environment where power, small memory sizes, and redundancy for reliability are issues. Exposing the important factors for parallelism, locality, communication, power, and reliability in a machine independent manner.
X-TUNE
GVR
CORVETTE
SLEEC

Note, PIPER is not listed as a column above, since it is intended as a recipient of this information.

Performance Tools

From Modelado Foundation