CORVETTE

CORVETTE


Team Members	UC Berkeley, LBNL
PI	Koushik Sen
Co-PIs	James W. Demmel (UC Berkeley), Costin Iancu (LBNL)
Website	CORVETTE
Download	{{{download}}}

Program Correctness, Verification, and Testing for Exascale or CORVETTE

Team Members

Researchers

Postdoctoral Researchers

Graduate Student Researcher

Cuong Nguyen

Motivation

High performance scientific computing
- Exascale: O(10⁶) nodes, O(10³) cores per node
- Requires asynchrony and “relaxed” memory consistency
- Shared memory with dynamic task parallelism
- Languages allow remote memory modification

Correctness challenges
- Non-deterministic causes hard to diagnose correctness and performance bugs
  - Data races, atomicity violations, deadlocks …
- Bugs in DSL
- Scientific applications use floating-points: non-determinism leads to non-reproducible results
- Numerical exceptions can cause rare but critical bugs that are hard for non-experts to detect and fix

Goals

Develop correctness tools for different programming models: PGAS, MPI, dynamic parallelism

I. Testing and Verification

Identify sources of non-determinism in executions
Data races, atomicity violations, non‐reproducible floating point results
Explore state-of-the-art techniques that use dynamic analysis
Develop precise and scalable tools: < 2X overhead

II. Debugging

Use minimal amount of concurrency to reproduce bug
Support two-level debugging of high-level abstractions
Detect causes of floating-point anomalies and determine the minimum precision needed to fix them

Impact Summary

Corvette Impact Summary

List of Products

Publications

Floating Point Precision Tuning Using Blame Analysis Cindy Rubio-Gonz ́alez, Cuong Nguyen, James Demmel, William Kahan, Koushik Sen, Wim Lavrijsen, Costin Iancu, LBNL TR, April 17, 2015.
OPR: Partial Deterministic Record and Replay for One-Sided Communication Xuehai Qian, Koushik Sen, Paul Hargrove, Costin Iancu, LBNL TR, April 17, 2015.
Barrier Elision for Production Parallel Programs Milind Chabbi, Wim Lavrijsen, Wibe de Jong, Koushik Sen, John Mellor Crummey, Costin Iancu, PPOPP 2015, February 5, 2015.
"Parallel Reproducible Summation James Demmel", Hong-Diep Nguyen, In IEEE Transactions on Computers, Special Section on Computer Arithmetic 2014, April 30, 2014
Precimonious: Tuning Assistant for Floating-Point Precision. C. Rubio-González, C. Nguyen, H.D. Nguyen, J. Demmel, W. Kahan, K. Sen, D.H. Bailey, C. Iancu, and D. Hough. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13), Denver, CO, November 2013.
Fast Reproducible Floating-Point Summation. J. Demmel and H.D. Nguyen. In Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), Austin, TX, April 2013.
Numerical Accuracy and Reproducibility at ExaScale. J. Demmel and H.D. Nguyen. In Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), Austin, TX, April 2013.
Scaling Data Race Detection for Partitioned Global Address Space Programs. C. Park, K. Sen, C. Iancu. In Proceedings of the International Conference on Supercomputing (ICS'13), Eugene, OR, June 2013.

Technical Presentations

06/2014. Testing, Debugging, and Precision-tuning of Large-scale Parallel and Floating-point Programs. Presenter: Koushik Sen. Invited R. Narasimhan Lecture Award, TIFR, India.
05/2014. Precision Tuning of Floating-Point Programs. Presenters: Cindy Rubio-González and Cuong Nguyen. Berkeley Programming Systems Retreat, Santa Cruz, CA.
04/2014. Improving Software Reliability and Performance Using Program Analysis. Presenter: Cindy Rubio-González. University of California, Irvine, Irvine, CA.
03/2014. Improving Software Reliability and Performance Using Program Analysis. Presenter: Cindy Rubio-González. University of Texas at Dallas, Dallas, TX.
03/2014. Improving Software Reliability and Performance Using Program Analysis. Presenter: Cindy Rubio-González. University of California, Davis, Davis, CA.
02/2014. Improving Software Reliability and Performance Using Program Analysis. Presenter: Cindy Rubio-González. The University of New York at Buffalo, Buffalo, NY.
02/2014. Improving Software Reliability and Performance Using Program Analysis. Presenter: Cindy Rubio-González. College of William and Mary, Williamsburg, VA.
02/2014. Improving Software Reliability and Performance Using Program Analysis. Presenter: Cindy Rubio-González. Invited talk at SRI International, Menlo Park, CA.
01/2014. Precimonious: Tuning Assistant for Floating-Point Precision. Presenter: Cindy Rubio-González. UC Berkeley ASPIRE Winter Retreat, Tahoe, CA.
12/2013. Precimonious: Tuning Assistant for Floating-Point Precision. Presenter: Cindy Rubio-González. Invited talk at Bay Area Scientific Computing Day (BASCD'13), LBNL, Berkeley, CA.
11/2013. Precimonious: Tuning Assistant for Floating-Point Precision. Presenter: Cindy Rubio-González. Supercomputing Conference (SC'13), Denver, CO.
11/2013. ReproBLAS: Reproducible BLAS. Presenter: Hong Diep Nguyen. Supercomputing Conference (SC'13), Denver, CO.
11/2013. Precimonious: Tuning Assistant for Floating-Point Precision. Presenter: Cindy Rubio-González. Massachusetts Institute of Technology, Cambridge, MA.
11/2013. Precimonious: Tuning Assistant for Floating-Point Precision. Presenter: Cindy Rubio-González. Rising Stars in EECS Workshop, Poster Session, MIT, Cambridge, MA.
08/2013. Precimonious: Tuning Assistant for Floating-Point Precision. Presenter: Cindy Rubio-González. Invited talk at Oracle, Compiler Technical Talks Series, Santa Clara, CA.
06/2013. Scaling Data Race Detection for Partitioned Global Address Space Programs. Presenter: Costin Iancu. International Supercomputing Conference (ICS'13), Eugene, OR.
06/2013. Efficient Reproducible Floating-Point Reduction Operations on Large Scale Systems. Presenter: Hong Diep Nguyen. SIAM Annual Meeting 2013 (AN13), San Diego, CA.
06/2013. Precimonious: Tuning Assistant for Floating-Point Precision. Presenter: Cindy Rubio-González. Lawrence Berkeley National Laboratory DEGAS Retreat, Santa Cruz, CA.
04/2013. Fast Reproducible Floating-Point Summation. Presenter: Hong Diep Nguyen. 21st IEEE Symposium on Computer Arithmetic (ARITH21), Austin, TX.
04/2013. Numerical Reproducibility and Accuracy at ExaScale. Presenter: Hong Diep Nguyen. 21st IEEE Symposium on Computer Arithmetic (ARITH21), Austin, TX.
11/2012. Reproducible Floating Point Computation: Motivation, Algorithms, Diagnostics. Presenter: James Demmel. Birds of a Feather (BOF), Supercomputing Conference (SC'12), Salt Lake City, UT.

Software Releases

Testing and Verification Tools

Scalable Testing of Parallel Programs

Concurrent Programming is hard
- Bugs happen non‐deterministically
- Data races, deadlocks, atomicity violations, etc.

Goals: build a tool to test and debug concurrent and parallel programs
- Efficient: reduce overhead from 10x‐100x to 2x
- Precise
- Reproducible
- Scalable

Active random testing

Active Testing

Phase 1: Static or dynamic analysis to find potential concurrency bug patterns, such as data races, deadlocks, atomicity violations
- Data races: Eraser or lockset based [PLDI’08]
- Atomicity violations: cycle in transactions and happens‐before relation [FSE’08]
- Deadlocks: cycle in resource acquisition graph [PLDI’09]
- Publicly available tool for Java/Pthreads/UPC [CAV’09]
- Memory model bugs: cycle in happens‐before graph [ISSTA’11]
- For UPC programs running on thousands of cores [SC’11]

Phase 2: “Direct” testing (or model checking) based on the bug patterns obtained from phase 1
- Confirm bugs

Goals

Goal 1. Nice to have a trace exhibiting the data race
Goal 2. Nice to have a trace exhibiting the assertion failure
Goal 3. Nice to have a trace with fewer threads
Goal 4. Nice to have a trace with fewer context switches

Challenges for Exascale

Java and pthreads programs
- Synchronization with locks and condition variables
- Single node
Exascale has different programming models
- Large scale
- Bulk communication
- Collective operations with data movement
- Memory consistency
- Distributed shared memory
Cannot use centralized dynamic analyses
Cannot instrument and track every statement

Further Challenges

Targeted a simple programming paradigm
- Threads and shared memory
Similar techniques are available for MPI and CUDA
- ISP, DAMPI, MARMOT, Umpire, MessageChecker
- TASS uses symbolic execu7on
- PUG for CUDA
Analyze programs that mix different paradigms
- OpenMP, MPI, Shared Distributed Memory
- Need to correlate non‐determinism across paradigms

How Well Does it Scale?

Maximum 8% slowdown at 8K cores
- Franklin Cray XT4 Supercomputer at NERSC
- Quad‐core 2.x3GHz CPU and 8GB RAM per node
- Portals interconnect
Optimizations for scalability
- Efficient Data Structures
- Minimize Communication
- Sampling with Exponential Backoff

Debugging Tools

Debugging Project I

Detect bug with fewer threads and fewer context switches

Experience with C/PThreads: Over 90% of simplified traces were within 2 context switches of optimal

Small model hypothesis for parallel programs

Most bugs can be found with few threads
- 2‐3 threads
- No need to run on thousands of nodes
Most bugs can be found with fewer context switches [Musuvathi and Qadeer, PLDI 07]
- Helps in sequential debugging

Debugging Project II

Two‐level debugging of DSLs
Correlate program state across program versions

Debugging Project III

Find floating point anomalies
Recommend safe reduction of precision

Floating Point Debugging

Why do we care?

Usage of floating point programs has been growing rapidly
- HPC
- Cloud, games, graphics, finance, speech, signal processing

Most programmers are not expert in floating‐point!
- Why not use highest precision everywhere

High precision wastes
- Energy
- Time
- Storage

Floating Point Debugging Problem 1: Reduce unnecessary precision

Consider the problem of finding the arc length of the function

How can we find a minimal set of code fragments whose precision must be high?

Floating Point Debugging Problem 2: Detect Inaccuracy and Anomaly

What can we do?

We can reduce precision “safely”
- reduce power, improve performance, get better answer

Automated testing and debugging techniques
- To recommend “precision reduction”
- Formal proof of “safety” can be replaced by Concolic testing

Approach: automate previously hand-made debugging
- Concolic testing
- Delta debugging [Zeller et al.]

Implementation

Prototype implementation for C programs
- Uses CIL compiler framework
- http://perso.univ-perp.fr/guillaume.revy/index.php?page=debugging

Future plans
- Build on top of LLVM compiler framework

Potential Collaboration

Dynamic analyses to find bugs ‐ dynamic parallelism, unstructured parallelism, shared memory
- DEGAS, XPRESS, Traleika Glacier

Floating point debugging
- Co‐design centers

2‐level debugging
- D-TEC

Summary

Build testing tools
- Close to what programmers use
- Hide formal methods and program analysis under testing

If you are not obsessed with formal correctness
- Testing and debugging can help you solve these problems with high confidence