DSL's: Difference between revisions
From Modelado Foundation
imported>Dquinlan No edit summary |
imported>ChunhuaLiao No edit summary |
||
(46 intermediate revisions by 7 users not shown) | |||
Line 2: | Line 2: | ||
{| class="wikitable" | {| class="wikitable" | ||
! style="width: 200;" | | ! style="width: 200;" | X-Stack Project | ||
! style="width: 200;" | | ! style="width: 200;" | Name of the DSL | ||
! style="width: 200;" | | ! style="width: 200;" | URL | ||
! style="width: 200;" | | ! style="width: 200;" | Target domain | ||
! style="width: 200;" | | ! style="width: 200;" | Miniapps supported | ||
! style="width: 200;" | | ! style="width: 200;" | Front-end technology used | ||
! style="width: 200;" | | ! style="width: 200;" | Internal representation used | ||
! style="width: 200;" | | ! style="width: 200;" | Key Optimizations performed | ||
! style="width: 200;" | | ! style="width: 200;" | Code generation technology used | ||
! style="width: 200;" | | ! style="width: 200;" | Processors computing models targeted | ||
! style="width: 200;" | | ! style="width: 200;" | Current status | ||
| | ! style="width: 200;" | Summary of the best results | ||
| | ! style="width: 200;" | Interface for perf.&dbg. tools | ||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
| | |D-TEC | ||
|Halide | |||
|http://halide-lang.org | |||
|Image processing algorithms | |||
|Cloverleaf, miniGMG, boxlib | |||
|Uses C++ | |||
|Custom IR | |||
|Stencil optimizations (fusion, blocking, parallelization, vectorization) Schedules can produce all levels of locality, parallelism and redundant computation. OpenTuner for automatic schedule generation. | |||
|LLVM | |||
|X86 multicores, Arm and GPU | |||
|Working system. Used by Google and Adobe. | |||
|Local laplacian filter: Adobe top engineer took 3 months and 1500 loc to get 10x over original. Halide in 1-day, 60 lines 20x faster. In addition 90x faster GPU code in the same day (Adobe did not even try GPUs). Also, all the pictures taken by google glass is processed using a Halide pipeline. | |||
|Interfaces with the OpenTuner (http://opentuner.org) to automatically generate schedules. Working on visualizing/debugging tool. | |||
|- style="vertical-align:top;" | |||
|''DTEC | |||
| Shared Memory DSL | |||
|http://rosecompiler.org | |||
|MPI HPC applications on many core nodes | |||
|Internal LLNL App | |||
|Uses C (maybe C++ and Fortran in future) | |||
|ROSE IR | |||
|Shared memory optimization for MPI processes on many core architectures permits sharing large data structures between processes to reduce memory requirements per core. | |||
|ROSE + any vendor compiler | |||
|Many core architectures with local shared memory | |||
|Implementation released (4/28/2014) | |||
|Being evaluated for use | |||
| | | | ||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''D-TEC | ||
| | | X-GEN for heterogenous computing | ||
|( | | http://rosecompiler.org/ | ||
| | | HPC applications running on NVIDIA GPUs | ||
| | | boxlib, internal kernels | ||
| | | Uses C and C++ | ||
|( | | ROSE IR (AST) | ||
| | | loop collapse to expose more parallelism, Hardware-aware thread/block configuration, data reuse to reduce data transfer, round-robin loop scheduling to reduce memory footprint | ||
| | | ROSE source-to-source + NVIDIA CUDA compiler | ||
| NVIDIA GPUs | |||
| Implementation released with ROSE (4/29/2014) | |||
| Matches or outperforms caparable compilers targeting GPUs. | |||
| Generate event traces for gpuplot to identify serial bottleneck | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |'' D-TEC | ||
| | | NUMA DSL | ||
| | | http://rosecompiler.org | ||
| | | HPC applications on NUMA-support many core CPU | ||
| | | internal LLNL App | ||
| | | Uses C++ | ||
| | | ROSE IR | ||
| | | NUMA-aware data distribution to enhance data locality and avoid long memory latency. Multiple halo exchanging schemes for stencil codes using structured grid. | ||
| | | ROSE + libnuma support | ||
| | | Many core architecture with NUMA hierarchy | ||
| | | implementation in progress. | ||
| 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation. | |||
| PAPI is used for for performance profiling. libnuma and internal debugging scheme are used to verify memory distribution among NUMA nodes. | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''D-TEC | ||
| | |OpenACC | ||
| | |https://github.com/tristanvdb/OpenACC-to-OpenCL-Compiler | ||
|( | |Accelerated computing | ||
| | |Not yet. | ||
| | |C (possible C++ and Fortran). Pragma parser for ROSE. | ||
|( | | ROSE IR | ||
| Uses on tiling to map parallel loops to OpenCL | |||
|( | | ROSE (with OpenCL kernel generation backend), OpenCL C Compiler (LLVM) | ||
| | |Any accelerator with OpenCL support (CPUs, GPUs, XeonPhi, ...) | ||
|( | | - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi | ||
| Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point). | |||
| A profiling interface collects OpenCL profiling information in a database. | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''D-TEC | ||
| | | Rely | ||
| | | http://groups.csail.mit.edu/pac/rely/ | ||
| | | Reliability-aware computing and Approximate computing | ||
| | | Internal kernels | ||
| | | Subset of C with additional reliability annotations | ||
| | | Custom IR | ||
| | |A language and a static analysis framework for verifying reliability of programs given function-level reliability specifications. Chisel, a code transformation tool built on top of Rely, automatically selects operations that can execute unreliably with minimum resource consumption, while satisfying the reliability specification. | ||
| | | Generates C source code. Binary code generator implementation is in progress | ||
| | | - | ||
| | | Implementation in progress | ||
| Analysis of computational kernels from multimedia and scientific applications. | |||
| | |||
| | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''D-TEC | ||
| | | Simit | ||
| | | | ||
| | | Computations on domains expressible as a graph | ||
| | | Internal physics simulations, Lulesh, MiniFE, phdMesh, MiniGhost | ||
| | | Uses C++ | ||
| | | Custom IR | ||
| | | Fusion, Blocking, Vectorization, Parallelization, Distribution, Graph Index Sets | ||
| | | LLVM | ||
| | | X86 multicores, GPU and later distributed systems | ||
| | | Design and implementation in progress | ||
| | |||
| Has a visual backend. | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''TG X-Stack'' | ||
| | |CnC | ||
| | |https://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc | ||
| | |Medical Imaging, Media software | ||
| | |Lulesh, Rician Denoising, Registration, Segmentation | ||
| | |Graph builder | ||
| | |Graphs, Tags, Item Collections, Step Collections | ||
| | |dependence graph generation for data flow computation | ||
| | |CnC compiler | ||
| | |data flow computation | ||
| | |v0.9 | ||
|Out of box speedup for most apps, automatically discovers parallelism | |||
|TBB | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''TG X-Stack'' | ||
| | |HTA | ||
| | |http://polaris.cs.uiuc.edu/hta/ | ||
| | |Scientific applications targeting Matlab | ||
| | |Multigrid, AMR, LU, NAS parallel, SPIKE | ||
| | |HTAlib | ||
| | |Hierarchical tiled arrays | ||
| | |map-reduce operator framework, overlapped tiling, data layering | ||
| | |C++ compilers | ||
| | |Multicore, clusters | ||
| | |0.1 | ||
|Matched handcoded MPI | |||
|HPC Toolkit | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''TG X-Stack'' | ||
| | |HC | ||
| | |https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C | ||
| | |Medical Imaging, Oil and Gas Research | ||
| | |Lulesh, Rician Denoising, Graph 500, UTS, SmithWaterman | ||
| | |EDG | ||
| | |Sage | ||
| | |Continuable task generation to support finish | ||
| | |Rose | ||
| | |distributed data flow computation, structured parallel computation | ||
| | |v0.5 | ||
|Performs better that OpenMP for most apps | |||
|HPC Toolkit | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
| | |DEGAS | ||
| | |Asp | ||
| | |http://sejits.org | ||
| | |Infrastructure for building embedded Python DSLs | ||
| | | | ||
| | |Python syntax | ||
|( | |custom IR based on Python's AST | ||
| | |Key optimizations performed: loop transformations, vectorization, template-based code generation, caching | ||
| | |LLVM (in progress), C, C++, Scala, CUDA, OpenCL | ||
| | |x86, Nvidia GPUs, cloud, MPI | ||
| | |Numerous DSLs including structured grids, recursive communication-avoiding matmult, machine learning algorithms, communication-avoiding solvers | ||
|Structured grid DSL achieves 90%+ peak for many kernels on multiple platforms | |||
|In-progress. tech report on multi-tiered strategy. | |||
|- style="vertical-align:top;" | |- style="vertical-align:top;" | ||
|'' | |''DSL 9 | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|} | |} |
Latest revision as of 00:07, May 23, 2014
Sonia requested that Saman Amarasinghe and Dan Quinlan initiate this page. For comments, please contact them. This page is still in development.
X-Stack Project | Name of the DSL | URL | Target domain | Miniapps supported | Front-end technology used | Internal representation used | Key Optimizations performed | Code generation technology used | Processors computing models targeted | Current status | Summary of the best results | Interface for perf.&dbg. tools | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
D-TEC | Halide | http://halide-lang.org | Image processing algorithms | Cloverleaf, miniGMG, boxlib | Uses C++ | Custom IR | Stencil optimizations (fusion, blocking, parallelization, vectorization) Schedules can produce all levels of locality, parallelism and redundant computation. OpenTuner for automatic schedule generation. | LLVM | X86 multicores, Arm and GPU | Working system. Used by Google and Adobe. | Local laplacian filter: Adobe top engineer took 3 months and 1500 loc to get 10x over original. Halide in 1-day, 60 lines 20x faster. In addition 90x faster GPU code in the same day (Adobe did not even try GPUs). Also, all the pictures taken by google glass is processed using a Halide pipeline. | Interfaces with the OpenTuner (http://opentuner.org) to automatically generate schedules. Working on visualizing/debugging tool. | |
DTEC | Shared Memory DSL | http://rosecompiler.org | MPI HPC applications on many core nodes | Internal LLNL App | Uses C (maybe C++ and Fortran in future) | ROSE IR | Shared memory optimization for MPI processes on many core architectures permits sharing large data structures between processes to reduce memory requirements per core. | ROSE + any vendor compiler | Many core architectures with local shared memory | Implementation released (4/28/2014) | Being evaluated for use | ||
D-TEC | X-GEN for heterogenous computing | http://rosecompiler.org/ | HPC applications running on NVIDIA GPUs | boxlib, internal kernels | Uses C and C++ | ROSE IR (AST) | loop collapse to expose more parallelism, Hardware-aware thread/block configuration, data reuse to reduce data transfer, round-robin loop scheduling to reduce memory footprint | ROSE source-to-source + NVIDIA CUDA compiler | NVIDIA GPUs | Implementation released with ROSE (4/29/2014) | Matches or outperforms caparable compilers targeting GPUs. | Generate event traces for gpuplot to identify serial bottleneck | |
D-TEC | NUMA DSL | http://rosecompiler.org | HPC applications on NUMA-support many core CPU | internal LLNL App | Uses C++ | ROSE IR | NUMA-aware data distribution to enhance data locality and avoid long memory latency. Multiple halo exchanging schemes for stencil codes using structured grid. | ROSE + libnuma support | Many core architecture with NUMA hierarchy | implementation in progress. | 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation. | PAPI is used for for performance profiling. libnuma and internal debugging scheme are used to verify memory distribution among NUMA nodes. | |
D-TEC | OpenACC | https://github.com/tristanvdb/OpenACC-to-OpenCL-Compiler | Accelerated computing | Not yet. | C (possible C++ and Fortran). Pragma parser for ROSE. | ROSE IR | Uses on tiling to map parallel loops to OpenCL | ROSE (with OpenCL kernel generation backend), OpenCL C Compiler (LLVM) | Any accelerator with OpenCL support (CPUs, GPUs, XeonPhi, ...) | - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi | Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point). | A profiling interface collects OpenCL profiling information in a database. | |
D-TEC | Rely | http://groups.csail.mit.edu/pac/rely/ | Reliability-aware computing and Approximate computing | Internal kernels | Subset of C with additional reliability annotations | Custom IR | A language and a static analysis framework for verifying reliability of programs given function-level reliability specifications. Chisel, a code transformation tool built on top of Rely, automatically selects operations that can execute unreliably with minimum resource consumption, while satisfying the reliability specification. | Generates C source code. Binary code generator implementation is in progress | - | Implementation in progress | Analysis of computational kernels from multimedia and scientific applications. | ||
D-TEC | Simit | Computations on domains expressible as a graph | Internal physics simulations, Lulesh, MiniFE, phdMesh, MiniGhost | Uses C++ | Custom IR | Fusion, Blocking, Vectorization, Parallelization, Distribution, Graph Index Sets | LLVM | X86 multicores, GPU and later distributed systems | Design and implementation in progress | Has a visual backend. | |||
TG X-Stack | CnC | https://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc | Medical Imaging, Media software | Lulesh, Rician Denoising, Registration, Segmentation | Graph builder | Graphs, Tags, Item Collections, Step Collections | dependence graph generation for data flow computation | CnC compiler | data flow computation | v0.9 | Out of box speedup for most apps, automatically discovers parallelism | TBB | |
TG X-Stack | HTA | http://polaris.cs.uiuc.edu/hta/ | Scientific applications targeting Matlab | Multigrid, AMR, LU, NAS parallel, SPIKE | HTAlib | Hierarchical tiled arrays | map-reduce operator framework, overlapped tiling, data layering | C++ compilers | Multicore, clusters | 0.1 | Matched handcoded MPI | HPC Toolkit | |
TG X-Stack | HC | https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C | Medical Imaging, Oil and Gas Research | Lulesh, Rician Denoising, Graph 500, UTS, SmithWaterman | EDG | Sage | Continuable task generation to support finish | Rose | distributed data flow computation, structured parallel computation | v0.5 | Performs better that OpenMP for most apps | HPC Toolkit | |
DEGAS | Asp | http://sejits.org | Infrastructure for building embedded Python DSLs | Python syntax | custom IR based on Python's AST | Key optimizations performed: loop transformations, vectorization, template-based code generation, caching | LLVM (in progress), C, C++, Scala, CUDA, OpenCL | x86, Nvidia GPUs, cloud, MPI | Numerous DSLs including structured grids, recursive communication-avoiding matmult, machine learning algorithms, communication-avoiding solvers | Structured grid DSL achieves 90%+ peak for many kernels on multiple platforms | In-progress. tech report on multi-tiered strategy. | ||
DSL 9 |