DSL's: Difference between revisions
From Modelado Foundation
imported>Schulzm No edit summary |
imported>Schulzm No edit summary |
||
Line 41: | Line 41: | ||
|Implementation released (4/28/2014) | |Implementation released (4/28/2014) | ||
|Being evaluated for use | |Being evaluated for use | ||
| | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|''D-TEC | |''D-TEC | ||
Line 54: | Line 55: | ||
| Implementation released with ROSE (4/29/2014) | | Implementation released with ROSE (4/29/2014) | ||
| Matches or outperforms caparable compilers targeting GPUs. | | Matches or outperforms caparable compilers targeting GPUs. | ||
| | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' D-TEC | |'' D-TEC | ||
Line 68: | Line 70: | ||
| 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation. | | 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation. | ||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
| | |||
|''D-TEC | |''D-TEC | ||
|OpenACC | |OpenACC | ||
Line 80: | Line 83: | ||
| - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi | | - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi | ||
| Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point). | | Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point). | ||
| | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|''DSL 6 | |''DSL 6 | ||
| | |||
| | | | ||
| | | | ||
Line 95: | Line 100: | ||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|''DSL 7 | |''DSL 7 | ||
| | |||
| | | | ||
| | | | ||
Line 108: | Line 114: | ||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|''DSL 8 | |''DSL 8 | ||
| | |||
| | | | ||
| | | |
Revision as of 02:38, May 7, 2014
Sonia requested that Saman Amarasinghe and Dan Quinlan initiate this page. For comments, please contact them. This page is still in development.
X-Stack Project | Name of the DSL | URL | Target domain | Miniapps supported | Front-end technology used | Internal representation used | Key Optimizations performed | Code generation technology used | Processors computing models targeted | Current status | Summary of the best results | Interface for perf.&dbg. tools | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
D-TEC | Halide | http://halide-lang.org | Image processing algorithms | Cloverleaf, miniGMG, boxlib | Uses C++ | Custom IR | Stencil optimizations (fusion, blocking, parallelization, vectorization) Schedules can produce all levels of locality, parallelism and redundant computation. OpenTuner for automatic schedule generation. | LLVM | X86 multicores, Arm and GPU | Working system. Used by Google and Adobe. | Local laplacian filter: Adobe top engineer took 3 months and 1500 loc to get 10x over original. Halide in 1-day, 60 lines 20x faster. In addition 90x faster GPU code in the same day (Adobe did not even try GPUs). Also, all the pictures taken by google glass is processed using a Halide pipeline. | ||
DTEC | Shared Memory DSL | http://rosecompiler.org | MPI HPC applications on many core nodes | Internal LLNL App | Uses C (maybe C++ and Fortran in future) | ROSE IR | Shared memory optimization for MPI processes on many core architectures permits sharing large data structures between processes to reduce memory requirements per core. | ROSE + any vendor compiler | Many core architectures with local shared memory | Implementation released (4/28/2014) | Being evaluated for use | ||
D-TEC | Heterogeneous OpenMP | http://rosecompiler.org/ | HPC applications running on NVIDIA GPUs | boxlib, internal kernels | Uses C and C++ | ROSE IR (AST) | loop collapse to expose more parallelism, Hardware-aware thread/block configuration, data reuse to reduce data transfer, round-robin loop scheduling to reduce memory footprint | ROSE source-to-source + NVIDIA CUDA compiler | NVIDIA GPUs | Implementation released with ROSE (4/29/2014) | Matches or outperforms caparable compilers targeting GPUs. | ||
D-TEC | NUMA DSL | http://rosecompiler.org | HPC applications on NUMA-support many core CPU | internal LLNL App | Uses C++ | ROSE IR | NUMA-aware data distribution to enhance data locality and avoid long memory latency. Multiple halo exchanging schemes for stencil codes using structured grid. | ROSE + libnuma support | Many core architecture with NUMA hierarchy | implementation in progress. | 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation. | ||
D-TEC | OpenACC | https://github.com/tristanvdb/OpenACC-to-OpenCL-Compiler | Accelerated computing | Not yet. | C (possible C++ and Fortran). Pragma parser for ROSE. | ROSE IR | Uses on tiling to map parallel loops to OpenCL | ROSE (with OpenCL kernel generation backend), OpenCL C Compiler (LLVM) | Any accelerator with OpenCL support (CPUs, GPUs, XeonPhi, ...) | - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi | Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point). | ||
DSL 6 | |||||||||||||
DSL 7 | |||||||||||||
DSL 8 |