D-TEC: Difference between revisions
From Modelado Foundation
imported>Cdenny No edit summary |
imported>Cdenny No edit summary |
||
Line 3: | Line 3: | ||
| image = [[File:Your-team-logo.png|180px]] | | image = [[File:Your-team-logo.png|180px]] | ||
| imagecaption = | | imagecaption = | ||
| team-members = LLNL, MIT, Rice U., IBM, OSU, UC Berkeley, U. of Oregon, LBNL, UC San | | team-members = [https://www.llnl.gov/ LLNL], [http://www.mit.edu/ MIT], [http://www.rice.edu/ Rice U.], [http://www.ibm.com/ IBM], [http://www.osu.edu/ OSU], [http://www.berkeley.edu/ UC Berkeley], [https://www.uoregon.edu/ U. of Oregon], [http://www.lbnl.gov/ LBNL], [http://www.ucsd.edu/ UC San Diego] | ||
| pi = Daniel J. Quinlan (LLNL) | | pi = Daniel J. Quinlan (LLNL) | ||
| co-pi = Saman Amarasinghe (MIT), Armando Solar‐Lezama (MIT), Adam Chlipala (MIT), Srinivas Devadas (MIT), Una‐May O’Reilly (MIT), Nir Shavit (MIT), Youssef Marzouk (MIT), John Mellor‐Crummey (Rice U.), Vivek Sarkar (Rice U.), Vijay Saraswat (IBM), David Grove (IBM), P. Sadayappan (OSU), Atanas Rountev (OSU), Ras Bodik (UC Berkeley), Craig Rasmussen (U. of Oregon), Phil Colella (LBNL), Scott Baden (UC San Diego) | | co-pi = Saman Amarasinghe (MIT), Armando Solar‐Lezama (MIT), Adam Chlipala (MIT), Srinivas Devadas (MIT), Una‐May O’Reilly (MIT), Nir Shavit (MIT), Youssef Marzouk (MIT), John Mellor‐Crummey (Rice U.), Vivek Sarkar (Rice U.), Vijay Saraswat (IBM), David Grove (IBM), P. Sadayappan (OSU), Atanas Rountev (OSU), Ras Bodik (UC Berkeley), Craig Rasmussen (U. of Oregon), Phil Colella (LBNL), Scott Baden (UC San Diego) | ||
Line 27: | Line 27: | ||
== Team Members == | == Team Members == | ||
* Lawrence Livermore National Laboratory (LLNL) | * [https://www.llnl.gov/ Lawrence Livermore National Laboratory (LLNL)] | ||
* Massachusetts Institute of Technology (MIT) | * [http://www.mit.edu/ Massachusetts Institute of Technology (MIT)] | ||
* Rice University | * [http://www.rice.edu/ Rice University] | ||
* IBM | * [http://www.ibm.com/ IBM] | ||
* [http://www.osu.edu/ Ohio State University (OSU)] | * [http://www.osu.edu/ Ohio State University (OSU)] | ||
* University of California, Berkeley | * [http://www.berkeley.edu/ University of California, Berkeley] | ||
* University of Oregon | * [https://www.uoregon.edu/ University of Oregon] | ||
* Lawrence Berkeley National Laboratory | * [http://www.lbnl.gov/ Lawrence Berkeley National Laboratory (LBNL)] | ||
* University of California, San Diego | * [http://www.ucsd.edu/ University of California, San Diego] | ||
Line 183: | Line 183: | ||
** logically extends to neighbor processors | ** logically extends to neighbor processors | ||
** exchange_halo(Array) | ** exchange_halo(Array) | ||
'''LOPe programming model is easily expressed in Fortran because of syntax for concurrency''' | '''LOPe programming model is easily expressed in Fortran because of syntax for concurrency''' | ||
Line 250: | Line 249: | ||
** 8x2 --- 0.61 seconds | ** 8x2 --- 0.61 seconds | ||
** 16x1 --- 1.90 seconds | ** 16x1 --- 1.90 seconds | ||
=== X10 === | === X10 === | ||
Line 276: | Line 276: | ||
[[File:DTEC-X10-2.png|right|300px]] | [[File:DTEC-X10-2.png|right|300px]] | ||
''Native APGAS API'' | |||
* Provides C++/C APIs to APGAS functionality of X10 Runtime | * Provides C++/C APIs to APGAS functionality of X10 Runtime | ||
** Concurrency: async/finish | ** Concurrency: async/finish | ||
Line 291: | Line 291: | ||
** X10 programs have achieved good scaling at > 32k cores on P7IH (PERCS) and up to 8k cores on BlueGene/P | ** X10 programs have achieved good scaling at > 32k cores on P7IH (PERCS) and up to 8k cores on BlueGene/P | ||
* Support for Intra‐node | * Support for Intra‐node scalability | ||
scalability | |||
** async/finish enable high‐level programming of fine‐grained concurrency | ** async/finish enable high‐level programming of fine‐grained concurrency | ||
** Advanced features (clocks, collecting finish) support determinate programming of common concurrency idioms | ** Advanced features (clocks, collecting finish) support determinate programming of common concurrency idioms | ||
Line 303: | Line 302: | ||
** Places/at; collectives; one‐sided active messages; asynchronous bulk data transfer APIs | ** Places/at; collectives; one‐sided active messages; asynchronous bulk data transfer APIs | ||
** Utilizes available transports (PAMI, DCMF, MPI) | ** Utilizes available transports (PAMI, DCMF, MPI) | ||
=== SEEC Runtime === | === SEEC Runtime === | ||
Line 318: | Line 318: | ||
[[File:DTEC-SEEC.png|600px]] | [[File:DTEC-SEEC.png|600px]] | ||
=== Polyhedral Compiler Transformations === | === Polyhedral Compiler Transformations === | ||
Line 334: | Line 335: | ||
** Communication optimization using high‐level semantic information | ** Communication optimization using high‐level semantic information | ||
** Address challenges in applying polyhedral transformations to complex DOE application codes | ** Address challenges in applying polyhedral transformations to complex DOE application codes | ||
'''Multi‐target Domain‐specialized Code Generation''' | '''Multi‐target Domain‐specialized Code Generation''' | ||
[[File:DTEC-Code-Generation.png|600px]] | [[File:DTEC-Code-Generation.png|600px]] | ||
=== MIT Sketch === | === MIT Sketch === | ||
Line 348: | Line 349: | ||
[[File:DTEC-MIT-Sketch-1.png|600px]] | [[File:DTEC-MIT-Sketch-1.png|600px]] | ||
'''Sketching Enhanced Refinement: Low‐Level''' | '''Sketching Enhanced Refinement: Low‐Level''' | ||
Line 371: | Line 373: | ||
[[File:DTEC-Fault-Tolerance.png|right|300px]] | |||
== Fault Tolerance == | == Fault Tolerance == | ||
* User defines bound on expected error | * User defines bound on expected error | ||
* UQ to determine how fault contribute to total error | * UQ to determine how fault contribute to total error | ||
Line 378: | Line 383: | ||
* Errors due to faults modeled as random noise | * Errors due to faults modeled as random noise | ||
* Each random quantity ''_ε_i_'' captures transient fault influence on tasks | * Each random quantity ''_ε_i_'' captures transient fault influence on tasks | ||
== | == Tools for Legacy Code Modernization == | ||
* Incrementally add DSL constructs to legacy codes | * Incrementally add DSL constructs to legacy codes | ||
** Replace performance‐critical sections by DSLs | ** Replace performance‐critical sections by DSLs | ||
Line 398: | Line 401: | ||
** Higher performance using aggressive DSL optimization | ** Higher performance using aggressive DSL optimization | ||
** Performance portability without a complete rewrite | ** Performance portability without a complete rewrite | ||
== Tools for Understanding DSL Performance == | == Tools for Understanding DSL Performance == | ||
Line 423: | Line 427: | ||
== Migrating Existing | == Migrating Existing Codes == | ||
''' Benefits of custom, source to source translation''' | ''' Benefits of custom, source to source translation''' | ||
* Automatically restructure conventional code using a custom source‐to‐source translator ... | * Automatically restructure conventional code using a custom source‐to‐source translator ... |
Revision as of 21:58, February 8, 2013
D-TEC | |
---|---|
File:Your-team-logo.png | |
Team Members | LLNL, MIT, Rice U., IBM, OSU, UC Berkeley, U. of Oregon, LBNL, UC San Diego |
PI | Daniel J. Quinlan (LLNL) |
Co-PIs | Saman Amarasinghe (MIT), Armando Solar‐Lezama (MIT), Adam Chlipala (MIT), Srinivas Devadas (MIT), Una‐May O’Reilly (MIT), Nir Shavit (MIT), Youssef Marzouk (MIT), John Mellor‐Crummey (Rice U.), Vivek Sarkar (Rice U.), Vijay Saraswat (IBM), David Grove (IBM), P. Sadayappan (OSU), Atanas Rountev (OSU), Ras Bodik (UC Berkeley), Craig Rasmussen (U. of Oregon), Phil Colella (LBNL), Scott Baden (UC San Diego) |
Website | team website |
Download | {{{download}}} |
DSL Technology for Exascale Computing or D-TEC
Domain Specific Languages (DSLs) are a tranformational technology that capture expert knowledge about applica@on domains. For the domain scientist, the DSL provides a view of the high‐level programming model. The DSL compiler captures expert knowledge about how to map high‐level abstractions to different architectures. The DSL compiler’s analysis and transformations are complemented by the general compiler analysis and transformations shared by general purpose languages.
- There are different types of DSLs:
- Embedded DSLs: Have custom compiler support for high level abstractions defined in a host language (abstractions defined via a library, for example)
- General DSLs (syntax extended): Have their own syntax and grammar; can be full languages, but defined to address a narrowly defined domain
- DSL design is a responsibility shared between application domain and algorithm scientists
- Extraction of abstractions requires significant application and algorithm expertise
- We have an application team to:
- provide expertise that will ground our DSL research
- ensure its relevance to DOE and enable impact by the end of three years
Team Members
- Lawrence Livermore National Laboratory (LLNL)
- Massachusetts Institute of Technology (MIT)
- Rice University
- IBM
- Ohio State University (OSU)
- University of California, Berkeley
- University of Oregon
- Lawrence Berkeley National Laboratory (LBNL)
- University of California, San Diego
Goals and Objectives
D‐TEC Goal: Making DSLs Effective for Exascale
- We address all parts of the Exascale Stack:
- Languages (DSLs): define and build several DSLs economically
- Compilers: define and demonstrate the analysis and optimizations required to build DSLs
- Parameterized Abstract Machine: define how the hardware is evaluated to provide inputs to the compiler and runtime
- Runtime System: define a runtime system and resource management support for DSLs
- Tools: design and use tools to communicate to specific levels of abstraction in the DSLs
- We will provide effective performance by addressing exascale challenges:
- Scalability: deeply integrated with state‐of‐art X10 scaling framework
- Programmability: build DSLs around high levels of abstraction for specific domains
- Performance Portability: DSL compilers give greater flexibility to the code generation for diverse architectures
- Resilience: define compiler and runtime technology to make code resilient
- Energy Efficiency: machine learning and autotuning will drive energy efficiency
- Correctness: formal methods technologies required to verify DSL transformations
- Heterogeneity: demonstrate how to automatically generate lower level multi‐ISA code
- Our approach includes interoperability and a migration strategy:
- Interoperability with MPI + X: demonstrate embedding of DSLs into MPI + X applications
- Migration for Existing Code: demonstrate source‐to-source technology to migrate existing code
The D‐TEC approach addresses the full Exascale workflow
- Discovery of domain specific abstractions from proxy‐apps by application and algorithm experts
- (C1 & C2) Defining Domain Specific Languages (DSLs)
- The role of the DSL is to encapsulate expert knowledge
- About the problem domain
- The DSL compiler encapsulates how to optimize code for that domain on new architectures
- Rosebud used to define DSLs (a novel framework for joint optimization of mixed DSLs)
- DSL specification is used to generate a "DSL plug‐in” for Rosebud's DSL compiler
- Supports both embedded and general DSLs and multiple DSLs in one host‐language source file
- DSL optimization is done via cost‐based search over the space of possible rewritings
- Costs are domain‐specific, based on shared abstract machine model + ROSE analysis results
- Cross‐DSL optimization occurs naturally via search of combined rewriting space
- Sketching used to define DSLs (cutting‐edge aspect of our proposal)
- Series of manual refinements steps (code rewrites) define the transformations
- Equivalence checking between steps to verify correctness
- The series of transformations define the DSL compiler using ROSE
- Machine learning is used to drive optimizations
- Both approaches will leverage the common ROSE infrastructure
- Both approaches will leverage the SEEC enhanced X10 runtime system
- The role of the DSL is to encapsulate expert knowledge
- (C3) DSL Compiler
- Leverages ROSE compiler throughout
- (C4) Parameterized Abstract Machine
- Extraction of machine characteristics
- (C5) Runtime System
- Leverages X10 and extends it with SEEC support
- (C6) Tools
- We will define source-to‐source migration tools
- We will define the mappings between DSL layers to support future tools
Software Stack
Rosebud
Rosebud Overview
- Unified framework for DSL implementation
- all aspects: parsing, analysis, optimization, code generation
- all types: embedded, custom‐syntax, standalone
- Modular development and use of DSLs
- textual DSL description => plug‐in to ROSE DSL Compiler
- plug‐ins developed separately from ROSE and each other
- Knowledge‐based optimization of DSL programs
- plug‐in encapsulates expert optimization knowledge
- ROSE supplies conventional compiler optimizations
- Flexible code generation
- DSL lowered to any ROSE host language
- DSL compiled directly to (portable) machine code via LLVM
Rosebud Implementation
- DSL front end
- SGLR parser + predefined host‐language grammars
- attribute grammar + ROSE extensible AST and analysis
- DSL optimizer
- declarative rewriting system + procedural hooks to ROSE
- cost‐based heuristic search of implementation space
- domain‐specific costs based on abstract machine model
- cross‐DSL optimization arises naturally from joint search space
- DSL code generator
- ROSE host language unparsers
- ROSE AST => LLVM SSA code
Rosebud Plug-ins
- Plug‐ins developed separately from ROSE and each other
- Plug‐ins distributed in source or object form
- Selected plug‐ins supplied to Rosebud DSL Compiler to compile mixed DSLs in a host language source file
Rosebud DSL Compiler
Two‐phase parsing for DSL language support
- Host language + multiple DSLs in the same source file
- expressive custom notations
- familiar general-purpose language
- Phase 1: extract and parse DSLs
- via Stratego SDF parsing system
- Phase 2: parse host language
- via existing ROSE front ends
- Merge DSL tree fragments into host language AST
- DSL plug-ins provide custom tree nodes and semantic analysis
LOPe Programming Model
LOPe programming model is easily expressed in Fortran because of syntax for arrays
- Halo attribute added to arrays
- HALO(1:*:1, 1:*:1)
- specifies one border cell on each side of a two‐dimensional array
- * implies "stuff" in the middle
- Halos are logical cells not necessarily physically part of the array
- Halos can be communicated with coarrays
- DIMENSION(:,:)[:,:]
- halo region in pink
- logically extends to neighbor processors
- exchange_halo(Array)
LOPe programming model is easily expressed in Fortran because of syntax for concurrency
- Concurrent attribute added to procedures
- restricted semantics for array element access to avoid race conditions
- copy halo in, write single element out (visible after all threads exit)
- Called from within a DO CONCURRENT loop
Transformation (via ROSE) of a LOPe program to OpenCL allows execution on a GPU
Compiler Research is essential for DSLs (C3)
The DSL compiler captures expert knowledge about how to optimize high‐level abstractions to different architectures and is complemented by general compiler analysis and transformations such as that shared by general purpose language compilers. Architecture specific features are reasoned about through machine learning and/or the use of a parameterized abstract machine model that can be tailored to different machines.
- We will leverage existing technologies:
- Source‐to‐source technology in ROSE (LLNL and Rice)
- X10 front‐end for connection to ROSE (IBM)
- LLVM as low level IR in ROSE (LLNL and Rice)
- Polyhedral analysis to support optimizations (OSU)
- Machine learning to drive optimizations (MIT)
- Correctness checking (MIT and UCB)
- We will develop new technologies:
- Rosebud DSL specification
- DSL specific analysis and optimizations
- Automated DSL compiler generation
- X10 support in ROSE
- Define mappings between DSL layers to compiler analysis
- Refinement using equivalence checking
- Verification for transformations
- We will advance the state‐of‐the‐art:
- Formal methods use for HPC
- Generation of DSLs for productivity and performance portability
- Extending/Using polyhedral analysis to drive code generation for heterogeneous architectures
- Exascale challenges:
- Scalability: code generation for X10/SEEC and Scalable Data Structures, program synthesis
- Programmability: two approaches to DSL construction, automated equivalence checking
- Performance Portability: Using parameterized abstract machines, machine learning, auto‐tuning of refinement search spaces
- Resilience: Compiler‐based software TMR
- Energy Efficiency: using machine learning
- Interoperability and Migration Plans:
- Interoperability: A single compiler IR supports reusing analysis and transformations
- Migration Plan: Using source‐to‐source technology permits leveraging the vendor’s compiler
Preliminary Experimental Results (Habanero Hierarchical Place Tree)
- Actual hardware: four quad-core Xeon sockets; each socket contains two core-pairs; each core shares an L2 cache
- Possible abstract machine models:
- Use Habanero Hierarchal Place Tree (HPT) abstraction for these results
- Experiment with three HPT abstractions of same hardware:
- 1x16 --- one root place with 16 leaf places <This model focuses on L1 Cache locality>
- 8x2 --- 8 non-leaf places, each of which has 2 leaf places <This model focuses on the L2 cache shared by a core-pair>
- 16x1 --- like 1x16, except that it ignores the root place
- Preliminary execution times for SOR2D (size C) on above hardware underscore the importance of selecting the right abstraction for a given application-platform combination
- 1x16 --- 1.14 seconds
- 8x2 --- 0.61 seconds
- 16x1 --- 1.90 seconds
X10
Current X10 Runtime Software Stack
- Core Class Libraries
- Fundamental classes & primitives, Arrays, core I/O, collections, etc
- Written in X10; compiled to C++ or Java
- XRX (X10 Runtime in X10)
- APGAS functionality
- Concurrency: async/finish
- Distribution: Places/at
- Written in X10; compiled to C++ or Java
- APGAS functionality
- X10 Language Native Runtime
- Runtime support for core sequential X10 language features
- Two versions: C++ and Java
- X10RT
- Active messages, collectives, bulk data transfer
- Implemented in C
- Abstracts/unifies network layers (PAMI, DCMF, MPI, etc) to enable X10 on a range of transports
Leveraging X10 Runtime for Native Applications
Native APGAS API
- Provides C++/C APIs to APGAS functionality of X10 Runtime
- Concurrency: async/finish
- Distribution: Places/at
- Additionally exposes subset of X10RT APIs for use by native applications
- Collective operations
- One‐sided active messages
- Allows non‐X10 applications to leverage X10 runtime facilities via a library interface
Scalability of X10 Runtime
- Scalability
- X10 programs have achieved good scaling at > 32k cores on P7IH (PERCS) and up to 8k cores on BlueGene/P
- Support for Intra‐node scalability
- async/finish enable high‐level programming of fine‐grained concurrency
- Advanced features (clocks, collecting finish) support determinate programming of common concurrency idioms
- Workstealing implementation: both Fork/Join & Cilk‐style
- APGAS programing model extended to GPUs
- X10 kernels can be compiled to CUDA
- compiler-‐mediated data/control transfer between CPU/GPU
- Support for Inter‐node scalability
- Places/at; collectives; one‐sided active messages; asynchronous bulk data transfer APIs
- Utilizes available transports (PAMI, DCMF, MPI)
SEEC Runtime
- Understands high‐level goals
- E.g., performance, accuracy, power
- Makes observations
- Is app on current machine meeting goals?
- Understands actions
- Provided by opt. management, machine, uncertainty quantifica@on
- Makes decisions about how to take action given goals and current observations
- Uses control theory, machine learning, and possibly game theory
Polyhedral Compiler Transformations
- Advantages over standard AST‐based compiler frameworks
- Seamless support of imperfectly nested loops
- Handle symbolic loop bounds
- Powerful uniform model for composition of transformations
- Model‐driven optimization using the power of integer linear programming
- Work planned on D‐TEC project
- Leverage/integrate DSL properties in the optimization process
- Expose API for analysis and semantics‐preserving transformations of programs
- Multi‐target code generation using domain semantics and architecture characteristics
- Communication optimization using high‐level semantic information
- Address challenges in applying polyhedral transformations to complex DOE application codes
Multi‐target Domain‐specialized Code Generation
MIT Sketch
MIT Sketch: how does it work
- Synthesis engine works by elimination
- Partial implementation defines space of possible solutions
- Classes of incorrect solutions are eliminated by analyzing why particular incorrect solutions failed
Sketching Enhanced Refinement: Low‐Level
- Synthesis simplifies manual refinement
- Sophisticated implementation is simple if we can elide low‐level details
- Automated equivalence checking helps avoid bugs in the refinement process
Sketching Enhanced Refinement: High‐Level
The role of constraints on types
- Constraints can appear on classes or functions
- Constraints allow locality of reasoning and simplify synthesis
- Examples:
Fault Tolerance
- User defines bound on expected error
- UQ to determine how fault contribute to total error
- Represent total error a function of errors caused by transient faults (in individual tasks)
- Total error is a function of errors introduced in each faulty task
- Errors due to faults modeled as random noise
- Each random quantity _ε_i_ captures transient fault influence on tasks
Tools for Legacy Code Modernization
- Incrementally add DSL constructs to legacy codes
- Replace performance‐critical sections by DSLs
- Our “mixed‐DSLs + host language” architecture supports this
- Manual addition of DSL constructs is low risk
- Semi‐automatic addition of DSL constructs is promising
- Recognize opportunities for DSL constructs using same pattern‐matching as in rewriting system
- Human could direct, assist, verify, or veto
- Fully automatic rewriting of fragments to DSL constructs may be possible
- Benefits
- Higher performance using aggressive DSL optimization
- Performance portability without a complete rewrite
Tools for Understanding DSL Performance
- Challenges
- Huge semantic gap between embedded DSL and generated code
- Code generation for DSLs is opaque, debugging is hard, and fine‐grain performance attribution is unavailable
- Goal: Bridge semantic gap for debugging and performance tuning
- Approach
- Record information during program compilation
- two‐way mappings between every token in source and generated code
- transformation options, domain knowledge, cost models, and choices
- Monitor and attribute execution characteristics with instrumentation and sampling
- e.g., parallelism, resource consumption, contention, failure, scalability
- Map performance back to source, transformations, and domain knowledge
- Compensate for approximate cost models with empirical autotuning
- Record information during program compilation
- Technologies to be developed
- Strategies for maintaining mappings without overly burdening DSL implementers
- Strategies for tracking transformations, knowledge, and costs through compilation
- Techniques for exploring and explaining the roles of transformations and knowledge
- Algorithms for refining cost estimates with observed costs to support autotuning
Migrating Existing Codes
Benefits of custom, source to source translation
- Automatically restructure conventional code using a custom source‐to‐source translator ...
- ... that captures semantic knowledge of the application domain ... thereby improving performance
- Embedded Domain Specific Languages
- Automatically tolerate communication delays
- Squeeze out library overheads
- Library primitives → primitive language objects