DSL's: Difference between revisions

Latest revision as of 00:07, May 23, 2014

Sonia requested that Saman Amarasinghe and Dan Quinlan initiate this page. For comments, please contact them. This page is still in development.

X-Stack Project	Name of the DSL	URL	Target domain	Miniapps supported	Front-end technology used	Internal representation used	Key Optimizations performed	Code generation technology used	Processors computing models targeted	Current status	Summary of the best results	Interface for perf.&dbg. tools
D-TEC	Halide	http://halide-lang.org	Image processing algorithms	Cloverleaf, miniGMG, boxlib	Uses C++	Custom IR	Stencil optimizations (fusion, blocking, parallelization, vectorization) Schedules can produce all levels of locality, parallelism and redundant computation. OpenTuner for automatic schedule generation.	LLVM	X86 multicores, Arm and GPU	Working system. Used by Google and Adobe.	Local laplacian filter: Adobe top engineer took 3 months and 1500 loc to get 10x over original. Halide in 1-day, 60 lines 20x faster. In addition 90x faster GPU code in the same day (Adobe did not even try GPUs). Also, all the pictures taken by google glass is processed using a Halide pipeline.	Interfaces with the OpenTuner (http://opentuner.org) to automatically generate schedules. Working on visualizing/debugging tool.
DTEC	Shared Memory DSL	http://rosecompiler.org	MPI HPC applications on many core nodes	Internal LLNL App	Uses C (maybe C++ and Fortran in future)	ROSE IR	Shared memory optimization for MPI processes on many core architectures permits sharing large data structures between processes to reduce memory requirements per core.	ROSE + any vendor compiler	Many core architectures with local shared memory	Implementation released (4/28/2014)	Being evaluated for use
D-TEC	X-GEN for heterogenous computing	http://rosecompiler.org/	HPC applications running on NVIDIA GPUs	boxlib, internal kernels	Uses C and C++	ROSE IR (AST)	loop collapse to expose more parallelism, Hardware-aware thread/block configuration, data reuse to reduce data transfer, round-robin loop scheduling to reduce memory footprint	ROSE source-to-source + NVIDIA CUDA compiler	NVIDIA GPUs	Implementation released with ROSE (4/29/2014)	Matches or outperforms caparable compilers targeting GPUs.	Generate event traces for gpuplot to identify serial bottleneck
D-TEC	NUMA DSL	http://rosecompiler.org	HPC applications on NUMA-support many core CPU	internal LLNL App	Uses C++	ROSE IR	NUMA-aware data distribution to enhance data locality and avoid long memory latency. Multiple halo exchanging schemes for stencil codes using structured grid.	ROSE + libnuma support	Many core architecture with NUMA hierarchy	implementation in progress.	1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation.	PAPI is used for for performance profiling. libnuma and internal debugging scheme are used to verify memory distribution among NUMA nodes.
D-TEC	OpenACC	https://github.com/tristanvdb/OpenACC-to-OpenCL-Compiler	Accelerated computing	Not yet.	C (possible C++ and Fortran). Pragma parser for ROSE.	ROSE IR	Uses on tiling to map parallel loops to OpenCL	ROSE (with OpenCL kernel generation backend), OpenCL C Compiler (LLVM)	Any accelerator with OpenCL support (CPUs, GPUs, XeonPhi, ...)	- Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi	Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point).	A profiling interface collects OpenCL profiling information in a database.
D-TEC	Rely	http://groups.csail.mit.edu/pac/rely/	Reliability-aware computing and Approximate computing	Internal kernels	Subset of C with additional reliability annotations	Custom IR	A language and a static analysis framework for verifying reliability of programs given function-level reliability specifications. Chisel, a code transformation tool built on top of Rely, automatically selects operations that can execute unreliably with minimum resource consumption, while satisfying the reliability specification.	Generates C source code. Binary code generator implementation is in progress	-	Implementation in progress	Analysis of computational kernels from multimedia and scientific applications.
D-TEC	Simit		Computations on domains expressible as a graph	Internal physics simulations, Lulesh, MiniFE, phdMesh, MiniGhost	Uses C++	Custom IR	Fusion, Blocking, Vectorization, Parallelization, Distribution, Graph Index Sets	LLVM	X86 multicores, GPU and later distributed systems	Design and implementation in progress		Has a visual backend.
TG X-Stack	CnC	https://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc	Medical Imaging, Media software	Lulesh, Rician Denoising, Registration, Segmentation	Graph builder	Graphs, Tags, Item Collections, Step Collections	dependence graph generation for data flow computation	CnC compiler	data flow computation	v0.9	Out of box speedup for most apps, automatically discovers parallelism	TBB
TG X-Stack	HTA	http://polaris.cs.uiuc.edu/hta/	Scientific applications targeting Matlab	Multigrid, AMR, LU, NAS parallel, SPIKE	HTAlib	Hierarchical tiled arrays	map-reduce operator framework, overlapped tiling, data layering	C++ compilers	Multicore, clusters	0.1	Matched handcoded MPI	HPC Toolkit
TG X-Stack	HC	https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C	Medical Imaging, Oil and Gas Research	Lulesh, Rician Denoising, Graph 500, UTS, SmithWaterman	EDG	Sage	Continuable task generation to support finish	Rose	distributed data flow computation, structured parallel computation	v0.5	Performs better that OpenMP for most apps	HPC Toolkit
DEGAS	Asp	http://sejits.org	Infrastructure for building embedded Python DSLs		Python syntax	custom IR based on Python's AST	Key optimizations performed: loop transformations, vectorization, template-based code generation, caching	LLVM (in progress), C, C++, Scala, CUDA, OpenCL	x86, Nvidia GPUs, cloud, MPI	Numerous DSLs including structured grids, recursive communication-avoiding matmult, machine learning algorithms, communication-avoiding solvers	Structured grid DSL achieves 90%+ peak for many kernels on multiple platforms	In-progress. tech report on multi-tiered strategy.
DSL 9

@@ Line 14: / Line 14: @@
 ! style="width: 200;" | Current status
 ! style="width: 200;" | Summary of the best results
+! style="width: 200;" | Interface for perf.&dbg. tools
 |- style="vertical-align:top;"
 |D-TEC
@@ Line 27: / Line 28: @@
 |Working system. Used by Google and Adobe.
 |Local laplacian filter: Adobe top engineer took 3 months and 1500 loc to get 10x over original. Halide in 1-day, 60 lines 20x faster. In addition 90x faster GPU code in the same day (Adobe did not even try GPUs).  Also, all the pictures taken by google glass is processed using a Halide pipeline.
+|Interfaces with the OpenTuner (http://opentuner.org) to automatically generate schedules. Working on visualizing/debugging tool.
 |- style="vertical-align:top;"
 |''DTEC
@@ Line 33: / Line 35: @@
 |MPI HPC applications on many core nodes
 |Internal LLNL App
-|Uses C
+|Uses C (maybe C++ and Fortran in future)
 |ROSE IR
-|Share memory optimization for MPI processes on many core architectures
+|Shared memory optimization for MPI processes on many core architectures permits sharing large data structures between processes to reduce memory requirements per core.
-|ROSE
+|ROSE + any vendor compiler
 |Many core architectures with local shared memory
-|Implementation Released
+|Implementation released (4/28/2014)
 |Being evaluated for use
+|
+|- style="vertical-align:top;"
+|''D-TEC
+| X-GEN for heterogenous computing
+| http://rosecompiler.org/
+| HPC applications running on NVIDIA GPUs
+| boxlib, internal kernels
+| Uses C and C++
+| ROSE IR (AST)
+| loop collapse to expose more parallelism, Hardware-aware thread/block configuration, data reuse to reduce data transfer, round-robin loop scheduling to reduce memory footprint
+| ROSE source-to-source + NVIDIA CUDA compiler
+| NVIDIA GPUs
+| Implementation released with ROSE (4/29/2014)
+| Matches or outperforms caparable compilers targeting GPUs.
+| Generate event traces for gpuplot to identify serial bottleneck
+|- style="vertical-align:top;"
+|'' D-TEC
+| NUMA DSL
+| http://rosecompiler.org
+| HPC applications on NUMA-support many core CPU
+| internal LLNL App
+| Uses C++
+| ROSE IR
+| NUMA-aware data distribution to enhance data locality and avoid long memory latency.  Multiple halo exchanging schemes for stencil codes using structured grid.
+| ROSE + libnuma support
+| Many core architecture with NUMA hierarchy
+| implementation in progress.
+| 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation.
+| PAPI is used for for performance profiling.  libnuma and internal debugging scheme are used to verify memory distribution among NUMA nodes.
+|- style="vertical-align:top;"
+|''D-TEC
+|OpenACC
+|https://github.com/tristanvdb/OpenACC-to-OpenCL-Compiler
+|Accelerated computing
+|Not yet.
+|C (possible C++ and Fortran). Pragma parser for ROSE.
+| ROSE IR
+| Uses on tiling to map parallel loops to OpenCL
+| ROSE (with OpenCL kernel generation backend), OpenCL C Compiler (LLVM)
+|Any accelerator with OpenCL support (CPUs, GPUs, XeonPhi, ...)
+| - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi
+| Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point).
+| A profiling interface collects OpenCL profiling information in a database.
 |- style="vertical-align:top;"
-|''URL
+|''D-TEC
-|
+| Rely
-|
+| http://groups.csail.mit.edu/pac/rely/
-|
+| Reliability-aware computing and Approximate computing
-|
+| Internal kernels
-|
+| Subset of C with additional reliability annotations
-|
+| Custom IR
-|
+|A language and a static analysis framework for verifying reliability of programs given function-level reliability specifications. Chisel, a code transformation tool built on top of Rely, automatically selects operations that can execute unreliably with minimum resource consumption, while satisfying the reliability specification.
-|
+| Generates C source code. Binary code generator implementation is in progress
-|
+| -
+| Implementation in progress
+| Analysis of computational kernels from multimedia and scientific applications.
 |
 |
 |- style="vertical-align:top;"
-|''Target domain''
+|''D-TEC
-|
+| Simit
 |
-|
+| Computations on domains expressible as a graph
-|
+| Internal physics simulations, Lulesh, MiniFE, phdMesh, MiniGhost
-|
+| Uses C++
-|
+| Custom IR
-|
+| Fusion, Blocking, Vectorization, Parallelization, Distribution, Graph Index Sets
-|
+| LLVM
-|
+| X86 multicores, GPU and later distributed systems
-|
+| Design and implementation in progress
 |
+| Has a visual backend.
+|- style="vertical-align:top;"
+|''TG X-Stack''
+|CnC
+|https://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc
+|Medical Imaging, Media software
+|Lulesh, Rician Denoising, Registration, Segmentation
+|Graph builder
+|Graphs, Tags, Item Collections, Step Collections
+|dependence graph generation for data flow computation
+|CnC compiler
+|data flow computation
+|v0.9
+|Out of box speedup for most apps,  automatically discovers parallelism
+|TBB
+|- style="vertical-align:top;"
+|''TG X-Stack''
+|HTA
+|http://polaris.cs.uiuc.edu/hta/
+|Scientific applications targeting Matlab
+|Multigrid, AMR, LU, NAS parallel, SPIKE
+|HTAlib
+|Hierarchical tiled arrays
+|map-reduce operator framework, overlapped tiling, data layering
+|C++ compilers
+|Multicore, clusters
+|0.1
+|Matched handcoded MPI
+|HPC Toolkit
 |- style="vertical-align:top;"
-|''Miniapps supported
+|''TG X-Stack''
-|
+|HC
-|
+|https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C
-|
+|Medical Imaging, Oil and Gas Research
-|
+|Lulesh, Rician Denoising, Graph 500, UTS, SmithWaterman
-|
+|EDG
-|
+|Sage
-|
+|Continuable task generation to support finish
-|
+|Rose
-|
+|distributed data flow computation, structured parallel computation
-|
+|v0.5
-|
+|Performs better that OpenMP for most apps
+|HPC Toolkit
 |- style="vertical-align:top;"
-|''Xstack projects involved
+|DEGAS
-|
+|Asp
-|
+|http://sejits.org
-|
+|Infrastructure for building embedded Python DSLs
-|
-|
-|
-|
-|
-|
-|
 |
+|Python syntax
+|custom IR based on Python's AST
+|Key optimizations performed: loop transformations, vectorization, template-based code generation, caching
+|LLVM (in progress), C, C++, Scala, CUDA, OpenCL
+|x86, Nvidia GPUs, cloud, MPI
+|Numerous DSLs including structured grids, recursive communication-avoiding matmult, machine learning algorithms, communication-avoiding solvers
+|Structured grid DSL achieves 90%+ peak for many kernels on multiple platforms
+|In-progress. tech report on multi-tiered strategy.
 |- style="vertical-align:top;"
-|''Internal representation used
+|''DSL 9
-|
-|
-|
-|
-|
-|
-|
 |
-|
-|
-|
-|- style="vertical-align:top;"
-|''Key Optimizations performed
 |
 |

DSL's: Difference between revisions

From Modelado Foundation

Latest revision as of 00:07, May 23, 2014