Actions

DSL's: Difference between revisions

From Modelado Foundation

imported>ChunhuaLiao
No edit summary
imported>ChunhuaLiao
No edit summary
 
(5 intermediate revisions by 2 users not shown)
Line 45: Line 45:
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|''D-TEC
|''D-TEC
| Heterogeneous OpenMP
| X-GEN for heterogenous computing
| http://rosecompiler.org/
| http://rosecompiler.org/
| HPC applications running on NVIDIA GPUs
| HPC applications running on NVIDIA GPUs
Line 70: Line 70:
| implementation in progress.
| implementation in progress.
| 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation.
| 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation.
|
| PAPI is used for for performance profiling.  libnuma and internal debugging scheme are used to verify memory distribution among NUMA nodes.
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|''D-TEC
|''D-TEC
Line 84: Line 84:
| - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi
| - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi
| Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point).
| Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point).
| A profiling interface collects OpenCL profiling information in DB.
| A profiling interface collects OpenCL profiling information in a database.
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|''D-TEC
|''D-TEC
Line 157: Line 157:
|HPC Toolkit
|HPC Toolkit
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|''DSL 8
|DEGAS
|  
|Asp
|http://sejits.org
|Infrastructure for building embedded Python DSLs
|
|Python syntax
|custom IR based on Python's AST
|Key optimizations performed: loop transformations, vectorization, template-based code generation, caching
|LLVM (in progress), C, C++, Scala, CUDA, OpenCL
|x86, Nvidia GPUs, cloud, MPI
|Numerous DSLs including structured grids, recursive communication-avoiding matmult, machine learning algorithms, communication-avoiding solvers
|Structured grid DSL achieves 90%+ peak for many kernels on multiple platforms
|In-progress. tech report on multi-tiered strategy.
|- style="vertical-align:top;"
|''DSL 9
|
|
|
|
|

Latest revision as of 00:07, May 23, 2014

Sonia requested that Saman Amarasinghe and Dan Quinlan initiate this page. For comments, please contact them. This page is still in development.

X-Stack Project Name of the DSL URL Target domain Miniapps supported Front-end technology used Internal representation used Key Optimizations performed Code generation technology used Processors computing models targeted Current status Summary of the best results Interface for perf.&dbg. tools
D-TEC Halide http://halide-lang.org Image processing algorithms Cloverleaf, miniGMG, boxlib Uses C++ Custom IR Stencil optimizations (fusion, blocking, parallelization, vectorization) Schedules can produce all levels of locality, parallelism and redundant computation. OpenTuner for automatic schedule generation. LLVM X86 multicores, Arm and GPU Working system. Used by Google and Adobe. Local laplacian filter: Adobe top engineer took 3 months and 1500 loc to get 10x over original. Halide in 1-day, 60 lines 20x faster. In addition 90x faster GPU code in the same day (Adobe did not even try GPUs). Also, all the pictures taken by google glass is processed using a Halide pipeline. Interfaces with the OpenTuner (http://opentuner.org) to automatically generate schedules. Working on visualizing/debugging tool.
DTEC Shared Memory DSL http://rosecompiler.org MPI HPC applications on many core nodes Internal LLNL App Uses C (maybe C++ and Fortran in future) ROSE IR Shared memory optimization for MPI processes on many core architectures permits sharing large data structures between processes to reduce memory requirements per core. ROSE + any vendor compiler Many core architectures with local shared memory Implementation released (4/28/2014) Being evaluated for use
D-TEC X-GEN for heterogenous computing http://rosecompiler.org/ HPC applications running on NVIDIA GPUs boxlib, internal kernels Uses C and C++ ROSE IR (AST) loop collapse to expose more parallelism, Hardware-aware thread/block configuration, data reuse to reduce data transfer, round-robin loop scheduling to reduce memory footprint ROSE source-to-source + NVIDIA CUDA compiler NVIDIA GPUs Implementation released with ROSE (4/29/2014) Matches or outperforms caparable compilers targeting GPUs. Generate event traces for gpuplot to identify serial bottleneck
D-TEC NUMA DSL http://rosecompiler.org HPC applications on NUMA-support many core CPU internal LLNL App Uses C++ ROSE IR NUMA-aware data distribution to enhance data locality and avoid long memory latency. Multiple halo exchanging schemes for stencil codes using structured grid. ROSE + libnuma support Many core architecture with NUMA hierarchy implementation in progress. 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation. PAPI is used for for performance profiling. libnuma and internal debugging scheme are used to verify memory distribution among NUMA nodes.
D-TEC OpenACC https://github.com/tristanvdb/OpenACC-to-OpenCL-Compiler Accelerated computing Not yet. C (possible C++ and Fortran). Pragma parser for ROSE. ROSE IR Uses on tiling to map parallel loops to OpenCL ROSE (with OpenCL kernel generation backend), OpenCL C Compiler (LLVM) Any accelerator with OpenCL support (CPUs, GPUs, XeonPhi, ...) - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point). A profiling interface collects OpenCL profiling information in a database.
D-TEC Rely http://groups.csail.mit.edu/pac/rely/ Reliability-aware computing and Approximate computing Internal kernels Subset of C with additional reliability annotations Custom IR A language and a static analysis framework for verifying reliability of programs given function-level reliability specifications. Chisel, a code transformation tool built on top of Rely, automatically selects operations that can execute unreliably with minimum resource consumption, while satisfying the reliability specification. Generates C source code. Binary code generator implementation is in progress - Implementation in progress Analysis of computational kernels from multimedia and scientific applications.
D-TEC Simit Computations on domains expressible as a graph Internal physics simulations, Lulesh, MiniFE, phdMesh, MiniGhost Uses C++ Custom IR Fusion, Blocking, Vectorization, Parallelization, Distribution, Graph Index Sets LLVM X86 multicores, GPU and later distributed systems Design and implementation in progress Has a visual backend.
TG X-Stack CnC https://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc Medical Imaging, Media software Lulesh, Rician Denoising, Registration, Segmentation Graph builder Graphs, Tags, Item Collections, Step Collections dependence graph generation for data flow computation CnC compiler data flow computation v0.9 Out of box speedup for most apps, automatically discovers parallelism TBB
TG X-Stack HTA http://polaris.cs.uiuc.edu/hta/ Scientific applications targeting Matlab Multigrid, AMR, LU, NAS parallel, SPIKE HTAlib Hierarchical tiled arrays map-reduce operator framework, overlapped tiling, data layering C++ compilers Multicore, clusters 0.1 Matched handcoded MPI HPC Toolkit
TG X-Stack HC https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C Medical Imaging, Oil and Gas Research Lulesh, Rician Denoising, Graph 500, UTS, SmithWaterman EDG Sage Continuable task generation to support finish Rose distributed data flow computation, structured parallel computation v0.5 Performs better that OpenMP for most apps HPC Toolkit
DEGAS Asp http://sejits.org Infrastructure for building embedded Python DSLs Python syntax custom IR based on Python's AST Key optimizations performed: loop transformations, vectorization, template-based code generation, caching LLVM (in progress), C, C++, Scala, CUDA, OpenCL x86, Nvidia GPUs, cloud, MPI Numerous DSLs including structured grids, recursive communication-avoiding matmult, machine learning algorithms, communication-avoiding solvers Structured grid DSL achieves 90%+ peak for many kernels on multiple platforms In-progress. tech report on multi-tiered strategy.
DSL 9