### Vancouver: Designing a Next-Generation Software Infrastructure for Productive Heterogeneous Exascale Computing

Jeffrey S. Vetter (ORNL), PI Allen Malony (Oregon), Co-PI Wen-Mei Hwu (UIUC), Co-PI Rich Vuduc (GT), Co-PI Seyong Lee (ORNL) Jungwon Kim (ORNL) Joel Denny (ORNL) Kittisak Sajjapongse (ORNL) Sameer Shende (Oregon) Nicholas Chaimov (Oregon) Robert Lim (Oregon) Kevin Huck (Oregon) John Larson (UIUC) Carl Pearson (UIUC) Liwen Chang (UIUC)

0

UNIVERSITY OF OREGON





http://ft.ornl.gov/trac/vancouver

ORNL is managed by UT-Battelle for the US Department of Energy

http://ft.ornl.gov vetter@computer.org

**CAK RIDGE** National Laboratory

# **ASCR Computing At a Glance**

| System attributes        | NERSC<br>Now                                  | OLCF<br>Now                             | ALCF<br>Now                | NERSC Upgrade                                                                        | OLCF<br>Upgrade                                                    | ALCF Upgrades                                       |                                                                                         |
|--------------------------|-----------------------------------------------|-----------------------------------------|----------------------------|--------------------------------------------------------------------------------------|--------------------------------------------------------------------|-----------------------------------------------------|-----------------------------------------------------------------------------------------|
| Planned Installation     | Edison                                        | TITAN                                   | MIRA                       | Cori<br>2016                                                                         | Summit<br>2017-2018                                                | Theta<br>2016                                       | Aurora<br>2018-2019                                                                     |
| System peak (PF)         | 2.6                                           | 27                                      | 10                         | > 30                                                                                 | 150                                                                | >8.5                                                | 180                                                                                     |
| Peak Power (MW)          | 2                                             | 9                                       | 4.8                        | < 3.7                                                                                | 10                                                                 | 1.7                                                 | 13                                                                                      |
| Total system memory      | 357 TB                                        | 710TB                                   | 768TB                      | ~1 PB DDR4 +<br>High Bandwidth<br>Memory<br>(HBM)+1.5PB<br>persistent memory         | > 1.74 PB<br>DDR4 + HBM +<br>2.8 PB<br>persistent<br>memory        | >480 TB DDR4 +<br>High Bandwidth<br>Memory (HBM)    | > 7 PB High<br>Bandwidth On-<br>Package Memory<br>Local Memory and<br>Persistent Memory |
| Node performance<br>(TF) | 0.460                                         | 1.452                                   | 0.204                      | > 3                                                                                  | > 40                                                               | > 3                                                 | > 17 times Mira                                                                         |
| Node processors          | Intel Ivy<br>Bridge                           | AMD<br>Opteron<br>Nvidia<br>Kepler      | 64-bit<br>PowerPC<br>A2    | Intel Knights<br>Landing many<br>core CPUs<br>Intel Haswell CPU<br>in data partition | Multiple IBM<br>Power9 CPUs<br>&<br>multiple Nvidia<br>Voltas GPUS | Intel Knights<br>Landing Xeon Phi<br>many core CPUs | Knights Hill Xeon<br>Phi many core<br>CPUs                                              |
| System size (nodes)      | 5,600<br>nodes                                | 18,688<br>nodes                         | 49,152                     | 9,300 nodes<br>1,900 nodes in<br>data partition                                      | ~3,500 nodes                                                       | >2,500 nodes                                        | >50,000 nodes                                                                           |
| System Interconnect      | Aries                                         | Gemini                                  | 5D Torus                   | Aries                                                                                | Dual Rail<br>EDR-IB                                                | Aries                                               | 2 <sup>nd</sup> Generation<br>Intel Omni-Path<br>Architecture                           |
| File System              | 7.6 PB<br>168<br>GB/s,<br>Lustre <sup>®</sup> | 32 PB<br>1 TB/s,<br>Lustre <sup>®</sup> | 26 PB<br>300 GB/s<br>GPFS™ | 28 PB<br>744 GB/s<br>Lustre <sup>®</sup>                                             | 120 PB<br>1 TB/s<br>GPFS™                                          | 10PB, 210 GB/s<br>Lustre initial                    | 150 PB<br>1 TB/s<br>Lustre <sup>®</sup>                                                 |

# **Recent SHC Announcements**



### Vancouver: Designing a Next-Generation Software Infrastructure for Productive Heterogeneous Exascale Computing

- Funded in X-Stack call in 2010: LAB 10-257
  - Renewed in 2013
- Focus on Scalable Heterogeneous Computing (SHC)
  - In 2010, no SHC productions systems existed in DOE ASCR
  - SHC systems offered high performance, energy efficiency, and density
  - However, also had
  - challenges of low productivity,
  - poor portability,
  - lack of tools and libraries,
  - performance sensitivities

- Performance tools and benchmarks (SHOC, TAU, Parboil)
  - Benchmarking, Code analysis, inspection, transformation
- Create next-generation of tools to develop and understand SHC applications
  - Design high-level abstractions (OpenARC, MxPA, Tanagram)
- Directive-based SHC programming models w/ multilevel parallelism
- Implement libraries, tools, and runtime systems (autotuning SuperLU)
- Focus on DOE SHC applications and systems (many)
- Tutorials and Hackathons at conferences, various sites





Office of Science



Georgia College of Tech Computing

Computational Science and Engineerin

# **Recent Highlights**

- Languages
  - OpenARC performance portability studies across NVIDIA, AMD, Xeon Phi
    - Altera FPGA
  - Intelligent compiler selection of optimizations
- Benchmarks
  - Improvements to SHOC suite: OpenACC, more kernels
- Performance Tools
  - Performance measurement and analysis support for latest manycore accelerators (NVIDIA GPUs) and coprocessors (Intel KNC (KNL)) fully integrated in TAU Performance System
  - GPU performance prediction techniques based on static code analysis (instruction mix, control flow graph, occupancy) and dynamic analysis (instruction counts, branch frequency, memory reuse) and techniques with autotuning frameworks.
- Algorithms and autotuning
  - first hybrid, distributed memory CPU+co-processor (GPU + Phi) sparse direct solver, with on-going integration into SuperLU\_DIST jointly with Sherry Li @ LBNL
- Code Synthesis
  - High-performance, energy efficient, and portable code synthesis with Tangram
  - OpenCL performance portability for RSBench between CPUs and GPUs with MxPA



### Programming Models for Heterogeneous Systems



http://ft.ornl.gov/research/openarc

### **OpenARC: Open Accelerator Research Compiler**

- Problem
  - Directive-based accelerator programming models provide abstraction over architectural details and low-level
    programming complexities. However, too much abstraction puts significant burdens on performance tuning,
    debugging, and scaling.
- Solution
  - OpenARC is an open-sourced, very High-level Intermediate Representation (HIR)-based, extensible compiler framework, where various performance optimizations, traceability mechanisms, fault tolerance techniques, etc., can be built for better debuggability/performance/resilience on the complex accelerator computing.
- Tech Transfer
  - Deployed to Argonne National Laboratory, Los Alamos National Laboratory, IBM France, IBM US, Barcelona Supercomputing Center, University of La Laguna, Tokyo Institute of Technology, Auburn University, Duke University, Purdue University, etc.
  - Participate in OpenACC technical committee



S. Lee and J.S. Vetter, "OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing," in ACM Symposium on High- National Laboratory Performance Parallel and Distributed Computing (HPDC). Vancouver: ACM, 2014

### Understanding Performance Portability of High-level Programming Models for Heterogeneous Systems

#### Problem

Directive-based, high-level accelerator programming models such as OpenACC provide code portability. But how does it fare on performance portability? And what architectural features/compiler optimizations affect the performance portability? And how much?

#### Solution

- Proposed a high-level, architecture-independent intermediate language (HeteroIR) to map highlevel programming models (e.g., OpenACC) to diverse heterogeneous devices while maintaining portability.
- Using HeteroIR, port and measure the performance portability of various OpenACC applications on diverse architectures.

#### Results

8

- Using HeteroIR, OpenARC ported 12 OpenACC applications to diverse architectures (NVIDIA CUDA, AMD GCN, and Intel MIC), and measured the performance portability achieved across all applications.
- HeteroIR abstracts out the common architecture functionalities, which makes it easy for OpenARC (and other compilers) to support diverse heterogeneous architectures.
- HeteroIR, combined with rich OpenARC directives and built-in tuning tools, allows OpenARC to be used for various tuning studies on diverse architectures.



Amit Sabne, Putt Sakdhnagool, Seyong Lee, and Jeffrey S. Vetter. Understanding Portability of a High-level Programming Model on Contemporary Heterogeneous Architectures, IEEE Micro Volume 35, Issue 4 (DOI: 10.1109/MM.2015.73), 2015.

### **Overall Performance Portability**



Amit Sabne, Putt Sakdhnagool, Seyong Lee, and Jeffrey S. Vetter. Understanding Portability of a High-level Programming Model on Contemporary Heterogeneous Architectures, IEEE Micro Volume 35, Issue 4 (DOI: 10.1109/MM.2015.73), 2015.

# Intelligent selection of optimizations based on target architecture



Figure 5: Memory Coalescing Benefits on Different Architectures : MIC is impacted the least by the non-coalesced accesses



Figure 7: Impact of Tiling Transformation : *MATMUL* shows higher benefits than *JACOBI* owing to more contiguous accesses



Figure 9: Effects of Loop Unrolling - MIC shows benefits on unrolling



Fig. 11: Comparison of hand-written CUDA/OpenCL programs against auto-tuned OpenARC code versions : Tuned OpenACC programs perform reasonably well against hand-written codes

National Laboratory

OGE

### **OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing**

- Problem
  - Reconfigurable computers, such as FPGAs, offer more performance and energy efficiency for specific workloads than other heterogeneous systems, but their programming complexities and low portability have limited their deployment in large scale HPC systems.
- Solution
  - Proposed an OpenACC-to-FPGA translation framework, which performs source-to-source translation of the input OpenACC program into an output OpenCL code, which is further compiled to an FPGA program by the underlying backend Altera OpenCL compiler.
- Recent Results
  - Proposed several FPGA-specific OpenACC compiler optimizations and pragma extensions to achieve higher throughput.
  - Evaluated the framework using eight OpenACC benchmarks, and measured performance variations on diverse architectures (Altera FPGA, NVIDIA/AMD GPUs, and Intel Xeon Phi).



- Impact
  - Proposed translation framework is the first work to use a standard and portable, directive-based, high-level programming system for FPGAs.
  - Preliminary evaluation of eight OpenACC benchmarks on an FPGA and comparison study on other accelerators identified that the unique capabilities of an FPGA offer new performance tuning opportunities different from other accelerators.

Seyong Lee, Jungwon Kim, and Jeffrey S. Vetter. OpenACC to FPGA: A Framework for Directive-based High-Performance Reconfigurable Computing, IPDPS, 2016. <u>http://ft.ornl.gov/research/openarc</u>

National Laboratory

# **Optimizations for Reconfigurable Computing**



Figure 2: FPGA OpenCL Architecture



(a) Global Memory Access Without Channels (b) Global Memory Access With Channels

```
Figure 3: Difference in Global Memory Access Pattern as a Result of Channels Implementation
```

#### Listing 4: Altera OpenCL (AOCL) Channel Example

```
1
    #pragma acc data copyout(a[0:N]) create(b[0:N]) \\
 2
    copvin(c[0:N])
 3
 4
    #pragma acc kernels loop gang worker present(b, c)
 5
        for (i=0; i<N; i++) b[i] = c[i]*c[i];</pre>
 6
    #pragma acc kernels loop gang worker present(a, b)
        for (i=0; i<N; i++) a[i] = b[i];</pre>
 7
 8
      }
 9
            (a) Input OpenACC code
10
11
    #pragma acc data copyout(a[0:N]) pipe(b[0:N]) \\
12
    copyin(c[0:N])
13
14
    #pragma acc kernels loop gang worker pipeout(b) present(c)
15
        for (i=0; i<N; i++) b[i] = c[i]*c[i];
16
    #pragma acc kernels loop gang worker pipein(b) present(a)
17
        for (i=0; i<N; i++) a[i] = b[i];</pre>
18
19
            (b) Modified OpenACC code for kernel-pipelining
20
21
    #pragma OPENCL EXTENSION cl_altera_channels : enable
22
    channel float pipe_b;
23
    kernel void kernel0( global float * c)
24
25
      int i = get_global_id(0);
26
      write_channel_altera(pipe_b, (c[i]*c[i]));
27
28
     _kernel void kernel1(__global float * a)
29
30
      int i = get_global_id(0);
31
      a[i] = read channel altera(pipe b);
32
33
            (c) Output OpenCL code with channels
```



Benchmarks for Understanding Architectures and Programming Models



#### The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite

#### PI: Jeffrey S. Vetter, ORNL Future Technologies Group

#### https://github.com/vetter/shoc/wiki

#### **Objectives**

- Design and implement a set of performance and stability tests for HPC systems with heterogeneous architectures
- Implement each test in OpenCL, CUDA, OpenACC, OpenMP, and with MPI, to:
  - Evaluate the differences in these emerging programming models
  - Across diverse set of architectures (NVIDIA, AMD, ARM, Xeon Phi)
- Increase understanding of how important applications will map to emerging architectures
- Open Source for easy use, porting, contributions

#### Accomplishments

- Consistent open source software releases
- Overview published at 3rd Workshop General-Purpose Computation on Graphics Processing Units (GPGPU '10)
- Updated 2.0 version adds OpenACC and Intel Xeon Phi support via OpenMP and offload directives
- New architecture and programming model analysis to be published at 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS15) at SC15

#### Xeon Phi vs Kepler K20c (GFLOPS or GB/s)



#### Impact and Champions

- Over 10000 downloads internationally since 2010
- Approximately 275 citations
- Used by vendors, researchers, and for procurements
- Provide a standardized test suite for architecture evaluations, procurements, and acceptance tests
- Researchers and engineers from several computing device vendors are engaged and providing contributions

M.G. Lopez, J. Young, J.S. Meredith, P.C. Roth, M.Horton and J.S. Vetter, "Examining Recent Many-core Architectures and Programming Models Using SHOC", in 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS15), Austin, 2015.

A. Danalis, G. Marin, C. McCurdy, J. Meredith, P.C. Roth, K. Spafford, V. Tipparaju, and J.S. Vetter, "The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite," in Third Workshop on General-Purpose Computation on Graphics Processors (GPGPU 2010)<sup>^</sup>. Pittsburgh, 2010.

### **Performance Tools**



### **Performance Tools: TAU**

- New GPU sampling capabilities in TAU were applied to LAMMPS and LULESH to characterize their behavior during execution across multple GPU architectures.
- New control flow graph (CFG) analysis of GPU kernels was integrated in TAU and allowed more detailed characterization and autotuning of Gaussian across GPU architectures.
- Optimization of the OpenMC framework on Xeon Phi clusters (TACC Stampede, LBNL Babbage) with TAU resulted in the highest known single-node calculation rate for a Monte Carlo neutron transport code using a standard benchmark (17,000 particles/second, 95% distributed efficiency with 512 concurrent MIC devices).



### TAU Sampling Enables Insights in GPU Performance

- Problem
  - Hard to correlate performance spikes with source code
  - Event queue method: inject event at beginning and end of execution (no idea what happens in between)
- Solution
  - Sample GPU program counter during runtime
  - Map PC samples with disassembled instructions
  - Calculate metrics intensity at kernel level
    - Paper: Identifying Optimization Opportunities within Kernel Execution in GPU Codes (HeteroPar 2015)
- Recent results
  - Three GPUs: M2090 (Fermi), K80 (Tesla), M6000 (Maxwell)
  - LAMMPS: PK kernel mostly compute operations
  - LULESH: CKE, CMG, CE2 more compute-intensive, also branch- and move-intensive
- Impact
  - Understand kernel behavior in real time
  - Identifying where in code to spend tuning efforts







# Modeling Control Flow Behavior in GPU Architectures



- Problem
  - GPU architectures execute SIMD in lock step (mask threads that do not satisfy branch conditions)
  - Lanes allow branching threads to execute and non-branching threads to wait and synchronize (performance drawbacks)
- Solution
  - Control flow graphs (CFG) used to represent divergent branches
  - Derive execution frequencies, determine how application of input size N will perform without compiling or running application
- Recent results
  - Three GPUs: M2090 (Fermi), K80 (Tesla), M6000 (Maxwell)
    - Each GPU creates its own CFG
    - Higher trip counts seen for Gaussian on Fermi
  - Autotuned four kernel computations using Orio on three GPU architectures
- Impact
  - Predict optimal performance parameters for applications on a given architecture (e.g. thread counts, block sizes), based on PC Sampling data and basic block counts





Percentage of overall instruction operations executed for ATAX, BiCG, ex14FJ and MatVec2D kernels for five input sizes, comparing various architecture generations.



Block and Thread Settings - ex14FJ





Block and thread sizes for matVec2D kernel, comparing various architectures.



Overview of CUDAflow methodology in discovering application settings.



Control flow graphs generated for each CUDA kernel, comparing architecture families (Fermi, Kepler, Maxwell).



### Model Driven Autotuning Libraries: SuperLU Dist



# **Model Driven Autotuning Libraries**

- Focus on enabling technologies for libraries, especially for sparse and irregular computations on heterogeneous platforms
  - Computations we have accelerated under Vancouver 1 & 2: Fast transforms, N-body tree codes, and sparse matrix computations
  - Technologies we have developed: New data structures, model-driven autotuning
  - Platforms: *All* work targets distributed memory heterogeneous systems
- Highlight (from Vancouver 2): The *first* hybrid, distributed memory CPU+GPU sparse direct solver (sparse Gaussian elimination)
  - Collaboration with Sherry Li at LBNL (SciDAC-funded)
  - GPU work is integrated into the open-source library, SuperLU\_DIST
  - A unified framework targeting GPU and Xeon Phi is in progress



# SuperLU\_DIST (sparse Ax=b solver)



We analyzed complex dependences and data structures in prior version of SuperLU

We designed a model-driven autotuning framework for scheduling and adapting SuperLU to use co-processors (GPU + Xeon Phi)

# SuperLU\_DIST (sparse Ax=b solver)



Our results indicate good strong scaling (above), weak scaling (not shown) and absolute performance (fraction of peak) when running on accelerated systems.



# **Code Synthesis Tools**



### **Tangram – Code Synthesis for Performance Portability**

- Problem
  - High-performance, energy-efficient code is too costly to develop and maintain
- Solution
  - Idea: large, complex high-performance code are often built from small building block and a small set of decomposition/composition schemes
  - Complexity comes from multi-level composition/decomposition and optimizations
  - Users provide building blocks and rules (easy, reusable) and Tangram automate the multi-level composition/decomposition/optimization (complex, specific to each hardware type).
- Recent results
  - Synthesized code from simple portable sources matches or beats custom, hand-tuned C-level code across GPUs and CPUs
- Impact
  - Drastically reduced development and porting cost for high-performance, energy-efficient code at the exascale



### Vancouver II: RSbench results



Figure 1: OpenCL/MxPA performance versus original OpenMP code for RSbench. Hardware:[CPU:4 Intel Xeon E5-2680V2, GPU:1 Nvidia K40m]. OpenCL implementation was only 45% faster than the original OpenMP implementation; this was due to the overhead of transferring data to and from the GPU. The OpenCL re-targeted code by MXPA gets about 8x speedup over the original. The OpenCL explicitly exposes more of the available parallelism than the OpenMP code, MxPA takes advantage of this when scheduling work to the CPU cores.

- Available OpenCL GPU code retargeted for CPU execution through MxPA
- MxPA successfully provides portability from a single source language
- Outperforms the original OpenMP version when targeting both the GPU and CPU.







# **Tech Transfer and Some Futures**

### Tech transfer approaches

- Open source
- Deployment
- Working directly with users
- Vendor deployment
- Impact specifications like OpenACC

### Futures (same areas)

- Memory hierarchies
  - Unified memory
  - Paging
  - Software managed cache
- New heterogeneous architectures and levels of integration
- Performance prediction of architectural alternatives at runtime
- Preemption
- New performance analysis features



# Acknowledgements



- Contributors and Sponsors
  - Future Technologies Group: <u>http://ft.ornl.gov</u>
  - US Department of Energy Office of Science
    - DOE Vancouver Project: <u>https://ft.ornl.gov/trac/vancouver</u>
    - DOE Blackcomb Project: <u>https://ft.ornl.gov/trac/blackcomb</u>
    - DOE ExMatEx Codesign Center: <u>http://codesign.lanl.gov</u>
    - DOE Cesar Codesign Center: <u>http://cesar.mcs.anl.gov/</u>
    - DOE Exascale Efforts: <u>http://science.energy.gov/ascr/research/compute</u> <u>r-science/</u>
  - Scalable Heterogeneous Computing Benchmark team: <u>http://bit.ly/shocmarx</u>
  - US National Science Foundation Keeneland Project: <u>http://keeneland.gatech.edu</u>
  - US DARPA
  - NVIDIA CUDA Center of Excellence









# Software and tech transfer strategies

- Many ways our project distributes software and technology
  - Publicly available open source (OpenARC, SHOC, etc)
  - Deployed directly on DOE systems (Tau)
  - Deployed through vendor
    - NVIDIA CUSPARSE Tridiagonal Solver by UIUC
    - UIUC MxPA distributed by MultiCoreware
  - Influence broad community software (LLVM)
- Additional strategy
  - Develop research prototype reference implementation
  - Pursue interesting, relevant challenges
  - Use results and experiences to influence standards
    - OpenACC, OpenMP, SPEC, etc
  - Use pull of procurements to get new standards implemented

# **TAU Performance System®**

- Performance problem solving framework for HPC
  - Integrated, scalable, flexible, portable
  - Target all parallel programming / execution paradigms





- Integrated performance toolkit (open source)
  - Multi-level performance instrumentation

41

- Flexible and configurable performance measurement
- Widely-ported performance profiling / tracing system
- Performance data management and data mining



# **TAU Deployment Status**

- Initial port of TAU to IBM Power 8 Linux with GPUs
- OpenACC support with PGI 15.x compilers
- - BFD (binutils for address translation),
  - libunwind for callstack unwinding
  - OpenMP Tools interface based on LLVM runtime
  - Support for tracking memory footprint
  - Support for hardware performance counters to measure vectorization
  - Support for tracking NUMA effects (remote to total DRAM accesses)
- Support for ARM64 systems added, port to ARM64 with GPUs is underway
- TAU v2.25 released with BSD style license in November 2015
- http://tau.uoregon.edu

