Compilers

Sonia requested that Dan Quinlan initiate this page. For comments, please contact Dan Quinlan. This page is still in development.

PI
XPRESS	Ron Brightwell
TG	Shekhar Borkar
DEGAS	Katherine Yelick
D-TEC	Daniel Quinlan
DynAX	Guang Gao
X-TUNE	Mary Hall
GVR	Andrew Chien
CORVETTE	Koushik Sen
SLEEC	Milind Kulkarni
PIPER	Martin Schulz

Questions:

Describe how you expect to require compiler support within your X-Stack project.
Program analysis can be both challenging and require specialized expertise. What requirements do you have for program analysis and what level of expertise you expect to require? This problem could be posed in terms of what APIs for program analysis results do you expect?
What types of hardware do you expect to address/target within optimizations and at what level of granularity of the program (e.g. coarse-grain, over multiple functions, or fine-grain within statements)?
What general purpose languages do you expect to use and or extend to support your research work?
Do you expect to use, require, or develop an Embedded DSL (defined by compiler support that would leverage semantics of abstractions defined completely within a general purpose base language) or an Extended DSL (define by compiler support that would leverage semantics of abstractions defined by new syntax)?
What generic and customized code transformations do you require to support your project?
Which level of Intermediate Representation do you prefer to work with: source level, normalized middle level, or low level (close to binary code)?
Which parallel programming models (MPI, OpenMP, UPC, etc.) do you want to have better compiler support?
What OS configuration and hardware platforms do you want to run the compiler?
How do you expect compilers to incorporate domain-specific information, through DSL, separated semantics-specification files, or other methods?
How do you expect compilers to interact with your libraries or runtime systems, if any?

Describe how you expect to require compiler support within your X-Stack project.
XPRESS	Currently, the HPX/XPI approach is based on libraries being developed employing either the C or C++ compilers of which several are currently available. A future project, PXC, will develop an advanced compiler capability to fully represent the ParalleX execution model. But this is out of scope of XSTACK program.
TG	The compiler will be used at various levels in our project. From most basic to most complex: (a) support for the different ISA we have for our accelerator units (XEs); in particular intrinsic support for all the specialized instructions available (DMA, QMA, advanced math, ...). (b) Keyword extensions to make our programming model concepts first-class citizens for the compiler (data-blocks, EDTs, etc). We currently have limited keywords hinting at memory placement. (c) Advanced code generation using keywords which would enable the compiler to transform straight C (with keywords) into OCR code. (d) Code refactoring that would allow the compiler to take very fine-grained tasks/DBs and emit a code that would better balance runtime overheads and amount of parallelism expressed to better suit the target machine (this part is still TBD).
DEGAS
D-TEC	Compiler support plays an essential role in the D-TEC project. We use compilers to accept programs written in DSL, analyze, transform and optimize the programs. A backend compiler is also needed to generate the final executable on a target platform. To support flexible definition of DSLs, a compiler infrastructure is needed to facilitate adding extensions to base languages or defining a totally new languages. A DSL may be transformed from a high level format to a low level form. The compiler should be able to maintain the mapping between different levels so we can relate high level semantics and low level performance metrics.
DynAX	Compilers have two purposes in the DynAX project, which are both about increasing productivity and programmability. 1- The first role is to automatically parallelize domain-specific applications from a sequential specification. This is what R-Stream does, as it takes sequential C and produce parallel, scalable SWARM code. 2- The second role is to expose high-level parallel programming abstractions. This goal is addressed at two levels: The HTA compiler, which relies on an explicitly parallel intermediate representation (PIL), generates SCALE code. The SCALE compiler, in turn, offers object-oriented programming and simplifies programming to SWARM.
X-TUNE	We are developing source-to-source compiler transformations in our project. ROSE is used as an abstract syntax tree to our compiler and modeling software (CHiLL and PBound, respectively). We rely on native backend compilers to perform architecture-specific optimizations and generate SIMD code (SSE, AVX, etc.). Our transformations generate code that we anticipate will be easily vectorized.
GVR
CORVETTE
SLEEC	Compilers are used for two purposes in SLEEC: 1) To translate annotations on libraries into directives understood by SLEEC runtime systems (e.g., translating directives regarding inputs/outputs of kernels into SemCache API calls; 2) To perform high level optimizations of programs written using SLEEC-enabled libraries (e.g., performing linear algebra optimizations on applications written with BLAS).
PIPER	PIPER does not have/need its own compiler, but we need access to compiler generated information that captures high-level semantics of the language implemented (basically a form of DWARF for DSLs). Additionally, advanced instrumentation or instrumentation point/markers could be useful. Information provided by different DSLs should be interoperable, ideally standardized and compatible with existing debug information provided by existing compilers

Program analysis can be both challenging and require specialized expertise. What requirements do you have for program analysis and what level of expertise you expect to require? This problem could be posed in terms of what APIs for program analysis results do you expect?
XPRESS	The XPRESS project is exploring the strengths and opportunities enabled through runtime control for dynamic adaptive resource management and task scheduling. Through this investigation some compile time information will be exposed as of importance to be conveyed to the compiler. It is expected that some compiler dataflow analysis of fine-grained parallelism will be required but this is typical now so doesn’t require new capabilities.
TG	Inter-procedural optimizations.
DEGAS
D-TEC	We would like to have access to a range of baseline compiler analyses through simple API functions. Typical examples are control flow analysis, data flow analysis, and dependence analysis. In addition, we have to have extensible versions of these baseline analyses so they can be applied to new DSLs. We also need domain-specific analyses which can take advantages of domain knowledge in each DSL. Ideally, program analysis support should be able to leverage users input, through annotations or semantic-specification files.
DynAX	The PIL and SCALE compilers support parallel languages as their input (PIL supports Hierarchically Tiled Arrays for data-parallel programs, and SCALE accepts structured, object-oriented codelet programs). The R-Stream compiler supports sequential C loops to which a set of writing rules (a "style") are applied by the programmer. The rules, which entail exposing enough static information to the compiler, are defined in the R-Stream user guide. Some pragmas are defined that allow the user to provide additional hints to the compiler. R-Stream relies on extensions of the polyhedral model to represent, analyze and transform programs.
X-TUNE	At present, we have implemented the program analysis we need for our optimizations.
GVR
CORVETTE
SLEEC	We expect to use analysis results from frameworks like Fuse to build our IR for compiler transformations. We will be working on extending Fuse to perform analysis over "locations" that are matrices, or disjoint sets of memory locations.
PIPER	PIPER mainly focuses on runtime performance analysis. This could benefit greatly from access to static information (loop structures, static call trees, data structures, ...), which should be provided by compilers in a standardized manner (through APIs or debug information encoded into binaries). HPCToolkit - one of PIPER's tools - parses machine code, reconstructs control flow graphs, performs interval analysis to identify loops, and then combines the information about loops with information about inlining from DWARF to attribute performance metrics to optimized code.

What types of hardware do you expect to address/target within optimizations and at what level of granularity of the program (e.g. coarse-grain, over multiple functions, or fine-grain within statements)?
XPRESS	The ParalleX based approach exposes very coarse-grain, medium-grain (e.g., threads), and fine-grain dataflow (e.g., instruction level) parallelism in support of heterogeneous functional unit and memory hierarchy hardware structures. But it is also intended to inform future hardware design for greater efficiency mechanisms in support of system-wide operation for communication, global addresses, and dynamic execution.
TG	While our programming model is applicable to today's machines, we are specifically considering machines that will have a global address space with strong NUMA characteristics. For a first pass, we expect the compiler to optimize EDTs (ie: small chunks of code but potentially spanning multiple functions). The granularity of the EDTs would be initially up to the user but during a later stage, the compiler may merge/split tasks to better match the task's granularity to the target machine.
DEGAS
D-TEC	We expect that the future extreme-scale computers will be heterogeneous node architectures connected with network connections. An example heterogeneous node architecture is a multicore shared memory machine with a NVIDIA GPU accelerator with a separated memory space. For shared memory, non-uniform memory access (NUMA) will be an expected solution to be scalable. We want to exploit coarse-grain parallelism of a program first, then incrementally take advantage of finer-grain parallelism and map them to the proper levels of hardware features.
DynAX	We currently target clusters of x86 multicore nodes for our experiments, but we are considering other targets as well, such as Intel's straw man architecture. So far we have worked at function-level granularity, but optimizing across multiple functions is possible to some degree.
X-TUNE	We are currently focused on two classes of processor: (1) cache coherent multi- and manycore CPU architectures including the Xeon Phi / MIC architecture; and, (2) NVIDIA GPU architectures which are coherent only within a thread block. The types of optimizations we perform include fusion across operators, wavefront parallelism, introducing ghost zones, and fine-grain rewriting of computations to improve SIMD and instruction-level parallelism. The fusion across operators could potentially be applied across functions.
GVR
CORVETTE
SLEEC	In addition to targeting general systems with our compiler transformations, we specifically target heterogeneous hardware (e.g., CPU/GPU nodes) with SemCache, and perform optimizations across method (library) calls.
PIPER	all of the above

What general purpose languages do you expect to use and or extend to support your research work?
XPRESS	XPRESS contends that other than for purposes of support of legacy codes, there is no correct language for the future of exascale in spite of the ardent claims of many of the supporters for Fortran, C++, OpenMP, MPI, Chapel, X10, and a long list of others. While it is true that the ultimate language is LISP, only a few truly enlightened individuals are qualified to recognize this.
TG	C
DEGAS
D-TEC	We are interested in a wide range of generic transformations, such as loop transformations, instrumentation, GPU code generation, and data structure transformation. We also want to have a compiler infrastructure which provides easy code transformation APIs so we can add customized, domain-specific transformations.
DynAX	Both R-Stream and SWARM are based on C as their input language
X-TUNE	We currently support C, C++ and Fortran because these are supported by the ROSE frontend.
GVR
CORVETTE
SLEEC	C/C++
PIPER	PIPER components will be written in C/C++ plus scripting languages (mostly python). All components/tools will be applicable to a wide range of source languages or even binaries. The exact list of supported languages is tbd. and will depend on demand and progress on the overall exascale software stack.

Do you expect to use, require, or develop an Embedded DSL (defined by compiler support that would leverage semantics of abstractions defined completely within a general purpose base language) or an Extended DSL (define by compiler support that would leverage semantics of abstractions defined by new syntax)?
XPRESS	DSLs hold the promise of relieving the burden of coding for key applications or functionality classes and XPRESS expects to support these. However, it is required that such DSLs to exhibit exascale capability that their back-ends will have to be targeted to the HPX runtime system either directly or through offered interfaces like XPI or PXC.
TG	Apart from limited keywords to better support the OCR programming model, we would support other languages (including DSLs) through the use of higher level source-to-source translators.
DEGAS
D-TEC	We want to support all flavours of DSLs.
DynAX	The R-Stream approach to domain-specificity is to define a programming style and enable domain-specific annotations. The advantage of this approach is that the user still programs in C, and doesn't need to learn a new syntax. We have developed SCALE as a general-purpose programming language within the domain of HPC. We have not conceived of any of the current SCALE features as being specific to a narrower domain than that.
X-TUNE	We are developing domain-specific optimizations for geometric multigrid and stencil computations. These optimizations could be incorporated into a DSL for such applications, but we are applying the optimizations directly to C and Fortran code. We are also developing a tool for expressing and optimizing tensor computations that could be considered a DSL
GVR
CORVETTE
SLEEC	N/A
PIPER	Possibly as interface to query and analyze performance data, but unclear as of now

What generic and customized code transformations do you require to support your project?
XPRESS
TG	Nothing specific is required but we will add keywords to C to support OCR's programming model
DEGAS
D-TEC	We are interested in a wide range of generic transformations, such as loop transformations, instrumentation, GPU code generation, and data structure transformation. We also want to have a compiler infrastructure which provides easy code transformation APIs so we can add customized, domain-specific transformations.
DynAX	R-Stream supports a wide range of loop and data layout transformations, some of which are specific to stencil operations. These transformations mainly create data locality and parallelism at various levels of the target architecture.
X-TUNE	We are developing the transformations ourselves.
GVR
CORVETTE
SLEEC	We leverage general code motion transformations to expose opportunities for high-level optimization.
PIPER	Instrumentation

Which level of Intermediate Representation do you prefer to work with: source level, normalized middle level, or low level (close to binary code)?
XPRESS	Currently the selected intermediate representation is a source code XPI interface although lower level HPX library calls through C or C++ are also enabled.
TG	LLVM has an IR that it uses all the way through. It is very flexible and we expect to work at that level.
DEGAS
D-TEC	Within D-TEC, we want to support all levels of intermediate representations to fully support the analysis, optimization and code generation of DSLs. For example, a high level representation is best to preserve high level code structures. A normalized middle-level representation is most important for many static analyses. A low-level IR is very suitable for machine-specific optimizations.
DynAX	We do not have a preference, but in our experience, users want to program at the highest possible level, while still being able to access low-level code and understand the parallelization process. Our source-to-source tools enable this.
X-TUNE	As we are applying source-to-source transformations, we prefer an intermediate representation that is close to the source level.
GVR
CORVETTE
SLEEC	Source level (ish). Our IR captures more information (dependences, semantic type information, etc) than is available at source level, but we want to be able to translate back to source.
PIPER	Mostly does not apply to PIPER, but some elements of autotuning could use code transformation, e.g., to create multiple variants of the same code

Which parallel programming models (MPI, OpenMP, UPC, etc.) do you want to have better compiler support?
XPRESS	XPRESS support MPI and OpenMP.
TG	The OCR programming model (fine grained event driven tasks).
DEGAS
D-TEC	We are interested in MPI and OpenMP. Ideally, the compiler should be aware of MPI function calls and have OpenMP implementation.
DynAX	None, although I'm not sure I fully understand the question (as of 5/5/14).
X-TUNE	N/A
GVR
CORVETTE
SLEEC	N/A
PIPER	N/A

What OS configuration and hardware platforms do you want to run the compiler?
XPRESS	Initially its anticipated that a cross-compiler will be used running on a standard Linux platform and targeting the HPX environment.
TG	Custom.
DEGAS
D-TEC	Linux is our main focus OS. Target platforms may use Intel/AMD x86 multicore machines with NVIDIA GPUs.
DynAX	PIL, R-Stream, SCALE and SWARM support Linux (complete list of tested distributions available). Successful Mac OS uses of R-Stream were also reported (using Darwin), although it is not officially supported. R-Stream can also cross-compile to any platform as long as the native low-level compiler supports this feature. For SCALE and SWARM, cross-compilation is done for Xeon Phi.
X-TUNE	We are currently supporting Linux-based systems, and also Nvidia GPUs.
GVR
CORVETTE
SLEEC	N/A
PIPER	N/A

How do you expect compilers to incorporate domain-specific information, through DSL, separated semantics-specification files, or other methods?
XPRESS
TG	We expect domain specific knowledge to be expressed through keywords and also specific API calls (that the compiler could identify if needed).
DEGAS
D-TEC	We want to support DSL, annotations, and separated semantics-specification files.
DynAX	As explained above, in R-Stream, a domain-specific formulation of the program is performed by combining style rules and annotations (#pragma). Semantics and syntax remain as well-defined as the underlying C language. PIL is a framework to facilitate any-to-any compilation of parallel languages. There is no domain specific information, but will work for any language.
X-TUNE	Scalability and determining what is an error seem like the biggest challenges.
GVR
CORVETTE
SLEEC	We expect domain specific knowledge to be encoded as annotations on library methods (potentially in a separate specification file).
PIPER	N/A

How do you expect compilers to interact with your libraries or runtime systems, if any?
XPRESS
TG	The compiler should ideally refactor the code to better match the hardware and restrict the number of choices the runtime has to make in a way that is non-constraining (ie: in cases where any other choice would be detrimental in a vast majority of cases). The compiler should also potentially present multiple alternatives to the runtime to allow directed choices. In short, the compiler should not make any decisions that it is not sure about but try to reduce the amount of decision space for the runtime (to reduce overhead).
DEGAS
D-TEC	The compiler will generate tasks (kernels) to be executed at runtime. It will also transform and partition data in programs so a NUMA-aware runtime library can be used.
DynAX	R-Stream generates codes that includes calls to the target machine's runtimes/libraries for the purpose of parallelization and locality optimization. R-Stream also supports library calls within the input code.
X-TUNE	Our compiler generates calls to run-time libraries to create threads. Currently, we support OpenMP and CUDA.
GVR
CORVETTE
SLEEC	N/A
PIPER	Yes: by providing additional debug/code to abstraction mapping information

Compilers

From Modelado Foundation