Correctness Tools

Questions:

What kind of runtime overhead can you accept when running your application with a dynamic analysis tools such as a data race detector? 1.1X, 01.5X, 2X, 5X, 10X, 100X.
What types of bugs do you want correctness tools to detect? Data races, deadlocks, non-determinism.
What kind of correctness tools can help you with interactive debugging in a debugger? Do you want those tools to discover bugs for you?
Current auto-tuning and performance/precision debugging tools treat the program as a black-box. What kind of white-box program analysis techniques can help in auto-tuning your application with floating-point precision and alternative algorithms?
What kind of testing strategy do you normally apply while developing your software component? Do you write tests before or after the development process? What kind of coverage do you try to achieve? Do you write small unit tests or large systems tests to achieve coverage? Are your tests non-deterministic?
After a bug is discovered, do you want automated support to create a simplified test that will reproduce the bug?
What is your strategy to debug a DSL and its runtime? What kind of multi-level debugging support do you want for DSLs?
What kind of visualization and presentation support for bugs do you want from correctness tools? What kind of IDE integration do you want for the correctness tools?
When combining various languages/runtimes, can your language/runtime be debugged in isolation from the rest of the system?
List the testing and debugging challenges that you think the next generation programming languages and models would face?
How can correctness tools help with reasoning about energy? How can correctness tools help with resilience?

What kind of runtime overhead can you accept when running your application with a dynamic analysis tools such as a data race detector? 1.1X, 01.5X, 2X, 5X, 10X, 100X.
XPRESS	Runtime overhead determines task granularity and therefore available parallelism (strong scaling) determining time to solution. Only fractional overheads can be tolerated at runtime.
TG	That depends upon how much of the application - which parts - can be excluded from analysis during the run, because that portion is trusted. If the problem is already isolated to particular section then a very significant hit to performance is acceptable to pinpoint it. If a single loop nest was already identified and that nest normally ran for a few seconds, the even 100x would be acceptable if it did not delay turnaround for more than a few minutes.
DEGAS
D-TEC	In debug mode 10X or even 100X for selected small input data sets. We consider running the app with dynamic analysis tools as part of testing. In release mode we do not expect to run with dynamic analysis tools.
DynAX	This should really be a question for the DoE app writers. I imagine that if run in debug mode, performance degrations are acceptable up to a few X. But most likely, if a dynamic tool is modifying program's execution time significantly, it also probably changes the way the program is executed (dynamic task schedules, non-determinism) and hence may analyze irrelevant things.
X-TUNE	Presumably, autotuning would only be applied when ready to run software in production mode. I suspect correctness software would only be used if the tuning process had some error, in which case some overhead would be tolerable.
GVR
CORVETTE
SLEEC
PIPER	N/A

What types of bugs do you want correctness tools to detect? Data races, deadlocks, non-determinism.
XPRESS	Hardware errors, runtime errors, discrepancies with respect to expectations (e.g., task time to completion, satisfying a condition) are among classes of bugs that detection mechanisms would be address. Probably these have to be build into the runtime/compile-time system and be consistent with overall execution model.
TG	Look for cycles in EDT graphs; congestion hotspots and livelocks.
DEGAS
D-TEC	It is important to distinguish when an error is detected, at compile time by static analyis, on-the-fly at run-time, requiring less storage space than post-mortem detection since much information can be discarded as execution progresses, or after execution by analyzing traces. Non-determinism can be the result of data races, hence, detecting data races is most important as it may contribute to identify sources of deadlocks and non-determinism. One might also combine the aforementioned different points in time of analysis and combine the results. We consider the different approaches to complement each other by having different trade offs in performance impact and memory consumption.
DynAX	SWARM proposes mechanisms (such as tags) to help avoiding data races. However, SWARM is compatible with traditional synchronization and data access mechanisms, and hence the programmer can create data races. Deadlocks can appear with any programming model based on task graphs, including the one supported by the SWARM runtime. They happen as soon as a cycle is created among tasks. ETI's tracer can help detect deadlocks by showing which tasks were running when the deadlock happened. Non-determinism is often a feature. It may be desired in parts of the programs (for instance when running a parallel reduction), not in others. So a tool for detecting non-determinism per se wouldn't be sufficient, it would require an API to specify when it is unexpected.
X-TUNE	We would want to identify errors introduced by the tuning process, so this could be any kind of error.
GVR
CORVETTE
SLEEC
PIPER	N/A

What kind of correctness tools can help you with interactive debugging in a debugger? Do you want those tools to discover bugs for you?
XPRESS	Beyond a certain level of scaling, debugging is almost indistinguishable from fault tolerance methods. Detection of errors (not found at compile time) requires equivalent mechanisms, diagnosis needs to add code as a possible source of error, but an additional kind of user interface is required, perhaps to provide a patch at runtime permitting continued execution from point of error. This is supported under DOE OS/R Hobbes project.
TG	Yes that would be ideal, but this is a tall request even in current debuggers on current systems.
DEGAS
D-TEC	A tool that can identify which assertions it cannot statically verify and that is applied in combination with a slicing tool that allows to compute a backward slice with a given set of concrete values of variables that are provided by the debugger.
DynAX	Some bugs can be discovered automatically, such as deadlocks and livelocks. For the others, tools need to reduce the time required to find the source of the bug.
X-TUNE	The most interesting tool would be one that could compare two different versions of the code to see where changes to variable values are observed.
GVR
CORVETTE
SLEEC
PIPER	N/A

Current auto-tuning and performance/precision debugging tools treat the program as a black-box. What kind of white-box program analysis techniques can help in auto-tuning your application with floating-point precision and alternative algorithms?
XPRESS	XPRESS does not address floating-point precision issues and will benefit from other programs in possible solutions. XPRESS does incorporate the APEX and RCR components for introspective performance optimization at runtime system controlling load balancing and task scheduling. It measures progress towards goals to adjust ordering, especially for critical path tasks.
TG	I would rephrase the question a bit. Intel's current production tools and the open sources tools in the community are actually capable of tracing possible problems back to source code lines and making suggestions. However they give few hints about performance problems that emanate from runtime or system issues. This would be desired.
DEGAS
D-TEC	Precision analysis that can verify assertions at different points in a program that specify expected precision and ranges of values of variables, subsets, or all values of an array at a point in execution of a program.
DynAX	Compilers can instrument the code to let auto-tuners focus on particular parts of the code and on particular execution characteristics (e.g., by selecting hardware counters). Please define "alternative algorithms".
X-TUNE	The key issue will be understanding when differences in output are acceptable, and when they represent an error.
GVR
CORVETTE
SLEEC
PIPER	N/A

What kind of testing strategy do you normally apply while developing your software component? Do you write tests before or after the development process? What kind of coverage do you try to achieve? Do you write small unit tests or large systems tests to achieve coverage? Are your tests non-deterministic?
XPRESS	Tests will be written after code and incorporated as part of Microcheckpointing Compute-Validate-Commit cycle for fault tolerance and debugging. Tests will be hierarchical for better but incomplete coverage. Phased checkpointing will provide very coarse-grained fall back in case of unrecoverable errors.
TG	Ad hoc blend of unit, group, application tests.
DEGAS
D-TEC	Writing tests is part of the development process. System and unit tests are run as part of an incremental integration process. Any found bugs are destilled into test cases that become part of the integration tests.
DynAX	Unit tests are written for each piece of software, and massive system tests are run every day. Tests support some amount of non-determinism, which result in bounded numerical variations in the results. Coverage of the test reflects expected use: often-used components get tested more intensively. System tests are often available before the development process, while unit tests are usually written during and after each code contribution.
X-TUNE	Comparing output between a version that is believed correct and an optimized version is the standard approach.
GVR
CORVETTE
SLEEC
PIPER	tbd.

After a bug is discovered, do you want automated support to create a simplified test that will reproduce the bug?
XPRESS	Yes
TG	Yes
DEGAS
D-TEC	Yes. We consider SMT solvers in this respect most promising.
DynAX	This would sound useful. The more simplified, the better.
X-TUNE	Yes
GVR
CORVETTE
SLEEC
PIPER	N/A

What is your strategy to debug a DSL and its runtime? What kind of multi-level debugging support do you want for DSLs?
XPRESS	XPRESS provides the HPX runtime that will incorporate its own set of test and correctness mechanisms under the Microcheckpointing methodology using the Compute-Validate-Commit cycle and built in testing. It is expected that DSL will support the generation of reverse test cases to support this.
TG	Very similar strategy to debugging other codes. There would be nothing fundamentally different.
DEGAS
D-TEC	For DSLs that are translated and/or lowered to a general purpose language we expect assertions present in the DSL also to be generated in lower-level code, and allowing to relate any failing assertions to points in the original DSL. We expect DSLs to allow for domain-specific assertions in combination with user-provided domain-specific knowledge in form of properties that can then be checked in the generated lower-level code.
DynAX
X-TUNE	Do we trust the DSL to translate code correctly? That seems like the fundamental question. The DSL developer should be able to debug the translation, while the DSL user should just be debugging the code that they added.
GVR
CORVETTE
SLEEC
PIPER	N/A

What kind of visualization and presentation support for bugs do you want from correctness tools? What kind of IDE integration do you want for the correctness tools?
XPRESS	Visualization and presentation support to correlate a detected error in terms of physical location, point in code, virtual user thread instantiation, and exact instruction causing the error will be of great value, especially if combined with a control framework for manipulating the execution stream at the offending point for diagnosis and correction.
TG	Visualization can be quite simple. A usefule example is the display used in Intel's performance tuning tools like Vtune. IDEs are personal choices as is the choice not to use one, but I would at a minimum select Eclipse for integration.
DEGAS
D-TEC	We want a query language that allows to mine any collected data as well as its scalable visualization showing dependencies and attached properties. This could support debugging at multiple levels.
DynAX	Any tool that allows big-picture and detailed views, as in ETI's tracer for SWARM. Tools that don't integrate with a particular IDE end up being compatible with all IDEs, including vi and emacs, which is desirable.
X-TUNE	Pinpointing code or data that are involved in an bug/bottleneck is helpful in debugging/performance tools.
GVR
CORVETTE
SLEEC
PIPER	Integration with performance analysis tools (similar domains to represent, don't want to burden users with multiple different tools)

When combining various languages/runtimes, can your language/runtime be debugged in isolation from the rest of the system?
XPRESS	Yes.
TG	In practice, but this is a goal. It could be called an aspect of separation of concerns.
DEGAS
D-TEC	This depends if the DSL is integrated into a general purpose language or a separate language that interfaces with the language run-time of the host language. In the latter case 'yes', otherwise 'no'.
DynAX	The entire execution stack is visible to the programmer at debug time, including the SWARM parts. We do not think that opacity across system parts would bring clarity in the debugging process.
X-TUNE	N/A -- We use standard languages and run-time support.
GVR
CORVETTE
SLEEC
PIPER	N/A

List the testing and debugging challenges that you think the next generation programming languages and models would face?
XPRESS	The principal challenges are detection and isolation.
TG	The challenges will necessarily be on providing information about data placement and data movement. This exist to a certain extent already in analysis for performance, but additional information about energy usage will be very valuable.
DEGAS
D-TEC	For the next generation of programming languages (and resiliency issues of next gen exascale hardware) we expect that avoiding errors by verifying properties at compile time will become increasingly important because debugging at run-time will become even more challenging due to resiliency complicating the debugging process in future.
DynAX	We expect the main challenges to be related to tractability of debugging and testing applications that run on millions of cores, and reproducibility of non-deterministic applications on a non-reliable hardware platform.
X-TUNE	Scalability and determining what is an error seem like the biggest challenges.
GVR
CORVETTE
SLEEC
PIPER	N/A

How can correctness tools help with reasoning about energy? How can correctness tools help with resilience?
XPRESS	With respect to energy, tools that can determine the critical path of execution would provide the basis for energy/power scheduling. As above for reliability, see 10) and other answers.
TG	Please see the note above. Related to resilience: simulations will not be deterministic. Tools that allow the develop to distinguish between differing results that stem, from different operation ordering and results that indicate an actual failure would be useful.
DEGAS
D-TEC	Correctness tool can identify erroneous paths and can therefore help to identify and remove those paths which otherwise could have a severe impact on the power-reasoning. Any violation of a property at run-time that has been established by a correctness tool to hold for a given execution path (before at compile-time) can be identified to be the consequence of resilience and trigger appropriate recovery actions at run-time.
DynAX	Correctness tools may be able to help the programmer make design choices which impact energy consumption, such as enabling non-determinism in parts of their application. Lightweight tools that detect faults are presumably directly applicable to resiliency, for instance by coupling them with checkpointing.
X-TUNE	Correctness tools can certainly help with resilience, if they have a concept of what is tolerable vs. an error. I don't see a connection to energy optimization.
GVR
CORVETTE
SLEEC
PIPER	Energy/power monitoring is a performance problem, in particular in power constrained scenarios

Correctness Tools

From Modelado Foundation