Actions

Performance Tools: Difference between revisions

From Modelado Foundation

imported>JohnMC
No edit summary
imported>Admin
No edit summary
 
(18 intermediate revisions by 6 users not shown)
Line 1: Line 1:
'''Questions:'''


{| class="wikitable"
* [[#table1|What abstractions does your runtime stack use for parallelism?]]
! style="width: 200;" | QUESTIONS
* [[#table2|What performance information would application developers need to know to tune codes that use your X-Stack project's software?]]
! style="width: 200;" | XPRESS
* [[#table3|What would a systems software developer need to know to tune the performance of your software stack?]]
! style="width: 200;" | TG X-Stack
* [[#table4|What information should a performance tool gather from each level in your software stack?]]
! style="width: 200;" | DEGAS
* [[#table5|What performance information can/does each level of your software stack maintain for inspection by performance tools?]]
! style="width: 200;" | D-TEC
* [[#table6|What information would your software stack need to maintain in order to measure per-thread or per-task, performance? Can this information be accessed safely from a signal handler? Could a performance tool register its own tasks to monitor the performance of the runtime?]]
! style="width: 200;" | DynAX
* [[#table7|What types of performance problems do you want tools to measure and diagnose? CPU resource consumption? CPU utilization? Network bandwidth? Network latency? Contention for shared resources? Waste? Inefficiency? Insufficient parallelism? Load-imbalance? Task dependences? Idleness? Data movement costs? Power or energy consumption? Failures and failure handling costs? The overhead of resilience mechanisms? I/O bandwidth consumed? I/O latency?]]
! style="width: 200;" | X-TUNE
* [[#table8|What kinds of performance problems do you foresee analyzing using post-mortem analysis?]]
! style="width: 200;" | GVR
* [[#table9|What kinds of performance problems do you foresee analyzing using runtime analysis? What interfaces will be needed to gather the necessary information?]]
! style="width: 200;" | CORVETTE
* [[#table10|What control interfaces will be necessary to enable runtime adaptation based on runtime performance measurements?]]
! style="width: 200;" | SLEEC
* [[#table11|There is a gap between the application-level and implementation-level views of programming languages and DSLs. What information should your software layers (compiler and runtime system) provide to attribute implementation-level performance measurement data to an application-level view?]]
! style="width: 200;" | PIPER
* [[#table12|What kind of visualization and presentation support do you want from performance tools? Do you envision any IDE integration for performance tools?]]
|- style="vertical-align:top;"
* [[#table13|List the performance challenges that you think next generation programming languages and models will face.]]
| What performance information would an application developer need to know about software developed in your X-Stack project?
 
|
<div id="table1"></div>{{
|
PerformanceToolsTable
|
|question=What abstractions does your runtime stack use for parallelism?
|
 
|
|xpressanswer=The XPRESS project identifies three levels of parallelism associated with the ParalleX execution model:
|
a. The coarse grain parallelism is the ParalleX Process that provides context for other child processes and other forms of computation and may span multiple localities (i.e., nodes), b. The medium grain ParalleX compute complex (e.g., thread instantiation) that runs on a single locality and is a partially ordered set of operations, and c. The fine grain operations that are coordinated by a static dataflow graph (DAG).
|
 
|
 
|
|tganswer=EDTs and data blocks.
|
 
|- style="vertical-align:top;"
|degasanswer=
| What performance information would a code developer in your project need to know about your software?
 
|
|dtecanswer=
|
 
|
|dynaxanswer=The basic unit of computation is the SWARM codelet. Codelets are grouped into SCALE procedures that allow the sharing of data across the codelets. On top of codelets and procedures, two higher level notations have been implemented: The Hierarchically Tiled Array object which includes data parallel operations and is implemented as a library, and Sequential C code auto-parallelized and translated by R-Stream into codelets.
|
 
|
|xtuneanswer=
|
 
|
|gvranswer=
|
 
|
|corvetteanswer=
|
 
|- style="vertical-align:top;"
|sleecanswer=
| What information should a performance tool gather from each level in your software stack? What performance information should each level of your software stack maintain for inspection by performance tools? How would a tool get this information: register a signal handler, call an async-signal safe function from a signal handler, register a callback, call a function from a registered callback, or something else?
 
|
}}
|
 
|
<div id="table2"></div>{{
|
PerformanceToolsTable
|
|question=What performance information would application developers need to know to tune codes that use your X-Stack project's software?
|
 
|
|xpressanswer=Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
|
 
|
|tganswer=With the goal being separation of concerns, no platform-specific information need to be known by the application developer. They need to provide hints that describe the software, using appropriate runtime APIs, which the runtime uses to aid in appropriate resource management.
|
 
|- style="vertical-align:top;"
|degasanswer=
| What types of performance problems do you want tools to measure and diagnose? CPU resource consumption? CPU utilization? Network bandwidth? Network latency? Contention for shared resources? Waste? Inefficiency? Insufficient parallelism? Serialization? Load-imbalance? Task dependences? Idleness? Data movement costs? Power or energy consumption? Failures and failure handling costs? The overhead of resilience mechanisms? I/O bandwidth consumed? I/O latency? I/O characterization
 
|
|dtecanswer=
|
 
|
|dynaxanswer=Cost of memory accesses across the different levels of the hierarchy, overhead associated with codelet initiation and coordination, scheduling strategies.
|
 
|
|xtuneanswer=
|
 
|  
|gvranswer=
|
 
|
|corvetteanswer=
|
 
|- style="vertical-align:top;"
|sleecanswer=
| What kinds of performance problems do you foresee analyzing using post-mortem analysis?
 
|
}}
|
 
|
<div id="table3"></div>{{
|
PerformanceToolsTable
|
|question=What would a systems software developer need to know to tune the performance of your software stack?
|
 
|
|xpressanswer=Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
|
 
|
|tganswer=The runtime exposes resource management modules (introspection, allocator, scheduler) using well-defined internal interfaces that can be replaced or tweaked by the systems software developer to target the underlying platform.
|
 
|- style="vertical-align:top;"
|degasanswer=
| What kinds of performance problems do you foresee analyzing using on-the-fly analysis? What interfaces will be needed to gather the necessary information? What control interfaces will be necessary to enable runtime adaptation based on performance measurements?
 
|
|dtecanswer=
|
 
|
|dynaxanswer=Cost of communication across different levels of the hierarchy, cost of context switching, size of different memory levels, classes of processors, parameters of the scheduling strategy.
|
 
|
|xtuneanswer=
|
 
|
|gvranswer=
|
 
|
|corvetteanswer=
|
 
|- style="vertical-align:top;"
|sleecanswer=
| There is a gap between the application-level and implementation-level views of programming languages and DSLs. What information should your software layers (compiler and runtime system) provide to attribute implementation-level performance measurement data to an application-level view?
 
|
}}
|
 
|
<div id="table4"></div>{{
|
PerformanceToolsTable
|
|question=What information should a performance tool gather from each level in your software stack?
|
 
|  
|xpressanswer=Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
|
 
|
|tganswer=At application layer - application profiling, at runtime - resource management decisions and runtime overheads, and at simulation - detailed resource usage including monitoring exposed by hardware.
|
 
|- style="vertical-align:top;"
|degasanswer=
| What kind of visualization and presentation support do you want from performance tools? Do you envision any IDE integration for performance tools?
 
|
|dtecanswer=
|
 
|
|dynaxanswer=Within the codelet: (a) Sequential performance of each codelet (e.g. in Gigaflops), and (b) Cost associated with memory accesses by instructions in a codelet. (e.g. how many accesses are cache hits, how many accesses go to scratch pad memory, how may to remote locations). Across codelets: (a) Overhead caused by both local and remote codelet initiation, cost of argument boxing, data communication costs (moving data), (b) Processor utilization and impact of the scheduling choices of the runtime stack, and (c) Other overhead of the runtime system.
|
 
|
|xtuneanswer=
|
 
|  
|gvranswer=
|
 
|
|corvetteanswer=
|
 
|- style="vertical-align:top;"
|sleecanswer=
| List the performance challenges that you think next generation programming languages and models will face.
 
|
|piperanswer=
|
}}
|
 
|
<div id="table5"></div>{{
|
PerformanceToolsTable
|
|question=What performance information can/does each level of your software stack maintain for inspection by performance tools?
|  
 
|
|xpressanswer=Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
|
 
|
|tganswer=Please see above.
|}
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
<div id="table6"></div>{{
PerformanceToolsTable
|question=What information would your software stack need to maintain in order to measure per-thread or per-task, performance? Can this information be accessed safely from a signal handler? Could a performance tool register its own tasks to monitor the performance of the runtime?
 
|xpressanswer=The APEX component of the HPX runtime system maintains the necessary information to measure per-thread performance including start, stop, suspend/pending event times, and ops counts. A performance tool can register its own tasks to monitor the performance of the runtime. Threads are first class objects and can be directly accessed by other threads.
 
|tganswer=Currently, the runtime maintains these information at varying degrees of granularity, depending on developers' choice (from instruction & byte counts all the way to task statistics), and this information is available for offline analysis. Future work will allow a portion of this analysis to be made online, so that custom performance tool tasks are accommodated.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
<div id="table7"></div>{{
PerformanceToolsTable
|question=What types of performance problems do you want tools to measure and diagnose? CPU resource consumption? CPU utilization? Network bandwidth? Network latency? Contention for shared resources? Waste? Inefficiency? Insufficient parallelism? Load-imbalance? Task dependences? Idleness? Data movement costs? Power or energy consumption? Failures and failure handling costs? The overhead of resilience mechanisms? I/O bandwidth consumed? I/O latency?
 
|xpressanswer=All of the above and more.
 
|tganswer=All the above mentioned, with the exception of I/O. In addition to these - runtime overheads at module-level granularity, memory use at different levels of the hierarchy, temperature & reaction time, DVFS & its effects.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=Yes. All of the above.
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
<div id="table8"></div>{{
PerformanceToolsTable
|question=What kinds of performance problems do you foresee analyzing using post-mortem analysis?
 
|xpressanswer=Post-mortem information would be useful to analyze non-causal behavioral data that cannot be predicted prior to execution. It must also differentiate this information from that which is entirely data dependent and therefore likely to change from that which is an intrinsic property of the program. A determination of the critical path and side path tasks combined with energy and time consumption requirements for each task would be very useful.
 
|tganswer=The primary problems diagnosed this way will be resource management decisions, and whether hints supplied by the program/compiler are internalized in decision making correctly. Additionally, runtime overheads will also be tracked closely.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=To find a balance between performance and power consumption. To have good resiliency for the system and the impact of faults on performance.
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
<div id="table9"></div>{{
PerformanceToolsTable
|question=What kinds of performance problems do you foresee analyzing using runtime analysis? What interfaces will be needed to gather the necessary information?
 
|xpressanswer=The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
 
|tganswer=DVFS decisions by the runtime, and its impacts will be analyzed using runtime analysis.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=To identify hardware/system/runtime bottlenecks, and to identify poor prioritization of the application's critical path. It's unclear what interfaces would provide sufficient introspection into the necessary systems without paying an unacceptable performance penalty, further discussion on this topic is welcome.
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
<div id="table10"></div>{{
PerformanceToolsTable
|question=What control interfaces will be necessary to enable runtime adaptation based on runtime performance measurements?
 
|xpressanswer=The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
 
|tganswer=This is an ongoing work, with scalability being the emphasis (since the metrics will provide huge volumes of data that will be hard to manage). Currently statistical properties of metrics is being considered to be used as proxies for various underlying causes. The interfaces are predominantly those exposed by hardware to the runtime (via counters), and minimal interfaces provided to resource management modules by the runtime.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=Execution time, energy consumption, cache miss ratio, fraction of accesses to each class of remote memory, latency of memory accesses, network collisions.
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
<div id="table11"></div>{{
PerformanceToolsTable
|question=There is a gap between the application-level and implementation-level views of programming languages and DSLs. What information should your software layers (compiler and runtime system) provide to attribute implementation-level performance measurement data to an application-level view?
 
|xpressanswer=For purposes of performance portability, the principal information required from programmer level is parallelism and some relative locality information. It is possible that some higher-level idiomatic patterns of control and access may be useful but these have as yet to be determined.
 
|tganswer=The runtime provides implementation-level performance at the runtime API level. Source transformations from high level application to runtime API would also need to provide mechanisms to also reverse-map the runtime-provided information at the implementation, level back to high level application. Currently, the implementation-level details can still be mapped back to application design with basic level of familiarity with the transformation tools.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=In the case of HTA, the cost of each HTA operation decomposed into computation (sequential and parallel), communication cost, locality (cache misses or equivalent if scratch pad memories are used), network congestion, energy consumption.
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
<div id="table12"></div>{{
PerformanceToolsTable
|question=What kind of visualization and presentation support do you want from performance tools? Do you envision any IDE integration for performance tools?
 
|xpressanswer=Visualization of resource usage and pending (bottlenecked) work will help to inform about intrinsic code parallelism and precedent constraints.
 
|tganswer=Some high level transformation tools already have a graphical representation of the program abstractions. Additionally, we also have a graphical representation of data movement and energy consumption at the simulator level. These will be enhanced to accommodate other performance metrics currently being tracked. IDE integration has not been a focus so far, but will be considered once the toolchain attains maturity.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=A system like the parallel study and thread profiler of Intel.
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
 
}}
 
<div id="table13"></div>{{
PerformanceToolsTable
|question=List the performance challenges that you think next generation programming languages and models will face.
 
|xpressanswer=Overhead and its impact on granularity, diversity of forms and scales of parallelism, parallelism discovery from meta data, energy suppression.
 
|tganswer=Based on our choice of the EDT (Event Driven Tasks) model, a primary challenge will be to ensure that the resource management overheads do not nullify the gains got due to the extra parallelism the model enables. We plan to address this by settling on the right granularity of task length and data block sizes so that the overheads are kept low, and the right balance between parallelism and management overheads is struck.
 
|degasanswer=
 
|dtecanswer=
 
|dynaxanswer=Tuning for a complex target environment where power, small memory sizes, and redundancy for reliability are issues. Exposing the important factors for parallelism, locality, communication, power, and reliability in a machine independent manner.
 
|xtuneanswer=
 
|gvranswer=
 
|corvetteanswer=
 
|sleecanswer=
 
}}
 
Note, PIPER is not listed as a column above, since it is intended as a recipient of this information.

Latest revision as of 05:50, May 20, 2014

Questions:

What abstractions does your runtime stack use for parallelism?
XPRESS The XPRESS project identifies three levels of parallelism associated with the ParalleX execution model:

a. The coarse grain parallelism is the ParalleX Process that provides context for other child processes and other forms of computation and may span multiple localities (i.e., nodes), b. The medium grain ParalleX compute complex (e.g., thread instantiation) that runs on a single locality and is a partially ordered set of operations, and c. The fine grain operations that are coordinated by a static dataflow graph (DAG).

TG EDTs and data blocks.
DEGAS
D-TEC
DynAX The basic unit of computation is the SWARM codelet. Codelets are grouped into SCALE procedures that allow the sharing of data across the codelets. On top of codelets and procedures, two higher level notations have been implemented: The Hierarchically Tiled Array object which includes data parallel operations and is implemented as a library, and Sequential C code auto-parallelized and translated by R-Stream into codelets.
X-TUNE
GVR
CORVETTE
SLEEC
What performance information would application developers need to know to tune codes that use your X-Stack project's software?
XPRESS Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
TG With the goal being separation of concerns, no platform-specific information need to be known by the application developer. They need to provide hints that describe the software, using appropriate runtime APIs, which the runtime uses to aid in appropriate resource management.
DEGAS
D-TEC
DynAX Cost of memory accesses across the different levels of the hierarchy, overhead associated with codelet initiation and coordination, scheduling strategies.
X-TUNE
GVR
CORVETTE
SLEEC
What would a systems software developer need to know to tune the performance of your software stack?
XPRESS Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
TG The runtime exposes resource management modules (introspection, allocator, scheduler) using well-defined internal interfaces that can be replaced or tweaked by the systems software developer to target the underlying platform.
DEGAS
D-TEC
DynAX Cost of communication across different levels of the hierarchy, cost of context switching, size of different memory levels, classes of processors, parameters of the scheduling strategy.
X-TUNE
GVR
CORVETTE
SLEEC
What information should a performance tool gather from each level in your software stack?
XPRESS Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
TG At application layer - application profiling, at runtime - resource management decisions and runtime overheads, and at simulation - detailed resource usage including monitoring exposed by hardware.
DEGAS
D-TEC
DynAX Within the codelet: (a) Sequential performance of each codelet (e.g. in Gigaflops), and (b) Cost associated with memory accesses by instructions in a codelet. (e.g. how many accesses are cache hits, how many accesses go to scratch pad memory, how may to remote locations). Across codelets: (a) Overhead caused by both local and remote codelet initiation, cost of argument boxing, data communication costs (moving data), (b) Processor utilization and impact of the scheduling choices of the runtime stack, and (c) Other overhead of the runtime system.
X-TUNE
GVR
CORVETTE
SLEEC
What performance information can/does each level of your software stack maintain for inspection by performance tools?
XPRESS Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
TG Please see above.
DEGAS
D-TEC
DynAX Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
X-TUNE
GVR
CORVETTE
SLEEC
What information would your software stack need to maintain in order to measure per-thread or per-task, performance? Can this information be accessed safely from a signal handler? Could a performance tool register its own tasks to monitor the performance of the runtime?
XPRESS The APEX component of the HPX runtime system maintains the necessary information to measure per-thread performance including start, stop, suspend/pending event times, and ops counts. A performance tool can register its own tasks to monitor the performance of the runtime. Threads are first class objects and can be directly accessed by other threads.
TG Currently, the runtime maintains these information at varying degrees of granularity, depending on developers' choice (from instruction & byte counts all the way to task statistics), and this information is available for offline analysis. Future work will allow a portion of this analysis to be made online, so that custom performance tool tasks are accommodated.
DEGAS
D-TEC
DynAX Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
X-TUNE
GVR
CORVETTE
SLEEC
What types of performance problems do you want tools to measure and diagnose? CPU resource consumption? CPU utilization? Network bandwidth? Network latency? Contention for shared resources? Waste? Inefficiency? Insufficient parallelism? Load-imbalance? Task dependences? Idleness? Data movement costs? Power or energy consumption? Failures and failure handling costs? The overhead of resilience mechanisms? I/O bandwidth consumed? I/O latency?
XPRESS All of the above and more.
TG All the above mentioned, with the exception of I/O. In addition to these - runtime overheads at module-level granularity, memory use at different levels of the hierarchy, temperature & reaction time, DVFS & its effects.
DEGAS
D-TEC
DynAX Yes. All of the above.
X-TUNE
GVR
CORVETTE
SLEEC
What kinds of performance problems do you foresee analyzing using post-mortem analysis?
XPRESS Post-mortem information would be useful to analyze non-causal behavioral data that cannot be predicted prior to execution. It must also differentiate this information from that which is entirely data dependent and therefore likely to change from that which is an intrinsic property of the program. A determination of the critical path and side path tasks combined with energy and time consumption requirements for each task would be very useful.
TG The primary problems diagnosed this way will be resource management decisions, and whether hints supplied by the program/compiler are internalized in decision making correctly. Additionally, runtime overheads will also be tracked closely.
DEGAS
D-TEC
DynAX To find a balance between performance and power consumption. To have good resiliency for the system and the impact of faults on performance.
X-TUNE
GVR
CORVETTE
SLEEC
What kinds of performance problems do you foresee analyzing using runtime analysis? What interfaces will be needed to gather the necessary information?
XPRESS The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
TG DVFS decisions by the runtime, and its impacts will be analyzed using runtime analysis.
DEGAS
D-TEC
DynAX To identify hardware/system/runtime bottlenecks, and to identify poor prioritization of the application's critical path. It's unclear what interfaces would provide sufficient introspection into the necessary systems without paying an unacceptable performance penalty, further discussion on this topic is welcome.
X-TUNE
GVR
CORVETTE
SLEEC
What control interfaces will be necessary to enable runtime adaptation based on runtime performance measurements?
XPRESS The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
TG This is an ongoing work, with scalability being the emphasis (since the metrics will provide huge volumes of data that will be hard to manage). Currently statistical properties of metrics is being considered to be used as proxies for various underlying causes. The interfaces are predominantly those exposed by hardware to the runtime (via counters), and minimal interfaces provided to resource management modules by the runtime.
DEGAS
D-TEC
DynAX Execution time, energy consumption, cache miss ratio, fraction of accesses to each class of remote memory, latency of memory accesses, network collisions.
X-TUNE
GVR
CORVETTE
SLEEC
There is a gap between the application-level and implementation-level views of programming languages and DSLs. What information should your software layers (compiler and runtime system) provide to attribute implementation-level performance measurement data to an application-level view?
XPRESS For purposes of performance portability, the principal information required from programmer level is parallelism and some relative locality information. It is possible that some higher-level idiomatic patterns of control and access may be useful but these have as yet to be determined.
TG The runtime provides implementation-level performance at the runtime API level. Source transformations from high level application to runtime API would also need to provide mechanisms to also reverse-map the runtime-provided information at the implementation, level back to high level application. Currently, the implementation-level details can still be mapped back to application design with basic level of familiarity with the transformation tools.
DEGAS
D-TEC
DynAX In the case of HTA, the cost of each HTA operation decomposed into computation (sequential and parallel), communication cost, locality (cache misses or equivalent if scratch pad memories are used), network congestion, energy consumption.
X-TUNE
GVR
CORVETTE
SLEEC
What kind of visualization and presentation support do you want from performance tools? Do you envision any IDE integration for performance tools?
XPRESS Visualization of resource usage and pending (bottlenecked) work will help to inform about intrinsic code parallelism and precedent constraints.
TG Some high level transformation tools already have a graphical representation of the program abstractions. Additionally, we also have a graphical representation of data movement and energy consumption at the simulator level. These will be enhanced to accommodate other performance metrics currently being tracked. IDE integration has not been a focus so far, but will be considered once the toolchain attains maturity.
DEGAS
D-TEC
DynAX A system like the parallel study and thread profiler of Intel.
X-TUNE
GVR
CORVETTE
SLEEC
List the performance challenges that you think next generation programming languages and models will face.
XPRESS Overhead and its impact on granularity, diversity of forms and scales of parallelism, parallelism discovery from meta data, energy suppression.
TG Based on our choice of the EDT (Event Driven Tasks) model, a primary challenge will be to ensure that the resource management overheads do not nullify the gains got due to the extra parallelism the model enables. We plan to address this by settling on the right granularity of task length and data block sizes so that the overheads are kept low, and the right balance between parallelism and management overheads is struck.
DEGAS
D-TEC
DynAX Tuning for a complex target environment where power, small memory sizes, and redundancy for reliability are issues. Exposing the important factors for parallelism, locality, communication, power, and reliability in a machine independent manner.
X-TUNE
GVR
CORVETTE
SLEEC

Note, PIPER is not listed as a column above, since it is intended as a recipient of this information.