ExaCT: Difference between revisions

Latest revision as of 16:47, February 13, 2013

ExaCT

Developer(s)	LBNL, SNL, LANL, ORNL, LLNL, NREL, Rutgers U., UT Austin, Georgia Tech, Standford U., U. of Utah
Stable Release	version x.y.z/Latest Release Date here
Operating Systems	Linux, Unix, etc.
Type	Computational Chemistry?
License	Open Source or else?
Website	http://exactcodesign.org

Introduction

Physics of Gas-Phase Combustion represented by PDE’s

Focus on gas phase combustion in both compressible and low-Mach limits
Fluid mechanics
- Conservation of mass
- Conservation of momentum
- Conservation of energy

Thermodynamics
- Pressure, density, temperature relationships for multicomponent mixtures

Chemistry
- Reaction kinetics

Species transport
- Diffusive transport of different chemical species within the flame

Code Base

S3D
- Fully compressible Navier Stokes
- Eighth-order in space, fourth order in time
- Fully explicit, uniform grid
- Time step limited by acoustics / chemical time scales
- Hybrid implementation with MPI + OpenMP
- Implemented for Titan at ORNL using OpenACC

LMC
- Low Mach number formulation
- Projection-based discretization strategy
- Second-order in space and time
- Semi-implicit treatment of advection and diffusion
- Time step based on advection velocity
- Stiff ODE integration methodology for chemical kinetics
- Incorporates block-structured adaptive mesh refinement
- Hybrid implementation with MPI + OpenMP

Target is computational model that supports compressible and low Mach number AMR simulation with integrated UQ

Adaptive Mesh Refinement

Need for AMR
- Reduce memory
- Scaling analysis – For explicit schemes flops scale with memory ^ 4/3

Block-structured AMR
- Data organized into logically-rectangular structured grids
- Amortize irregular work
- Good match for multicore architectures

AMR introduces extra algorithm issues not found in static codes
- Metadata manipulation
- Regridding operations
- Communications patterns

Preliminary Observations

Need to rethink how we approach PDE discretization methods for multiphysics applications
- Exploit relationship between scales
- More concurrency
- More locality with reduced synchronization
- Less memory / FLOP
- Analysis of algorithms has typically been based on a performance = FLOPS paradigm – can we analyze algorithms in terms of a more realistic performance model

Need to integrate analysis with simulation
- Combustion simulations are data rich
- Writing data to disk for subsequent analysis is currently near infeasibility
- Makes simulation look much more like physical experiments in terms of methodology

Current programming models are inadequate for the task
- We describe algorithms serially and add things to express parallelism at different levels of the algorithm
- We express codes in terms of FLOPS and let the compiler figure out the data movement
- Non-uniform memory access is already an issue but programmers can’t easily control data layout

Need to evaluate tradeoffs in terms of potential architectural features

How Core Numerics Will Change

Core numerics
- Higher-order for low Mach number formulations
- Improved coupling methodologies for multiphysics problems
- Asynchronous treatment of physical processes

Refactoring AMR for the exascale
- Current AMR characteristics
  - Global flat metadata
  - Load-balancing based on floating point work
  - Sequential treatment of levels of refinement
- For next generation
  - Hierarchical, distributed metadata
  - Consider communication cost as part of load balancing for more realistic estimate of work (topology aware)
  - Regridding includes cost of data motion
  - Statistical performance models
  - Alternative time-stepping algorithm – treat levels simultaneously

Data Analysis

Current simulations produce 1.5 Tbytes of data for analysis at each time step (Checkpoint data is 3.2 Tbytes)
- Archiving data for subsequent analysis is currently at limit of what can be done
- Extrapolating to the exascale, this becomes completely infeasible

Need to integrate analysis with simulation
- Design the analysis to be run as part of the simulation definition
  - Visualizations
  - Topological analysis
  - Lagrangian tracer particles
  - Local flame coordinates
  - Etc.

Approach based on hybrid staging concept
- Incorporate computing to reduce data volume at different stages along the path from memory to permanent file storage

Co-design Process

Identify key simulation element
- Algorithmic
- Software
- Hardware

Define representative code (proxy app)

Analytic performance model
- Algorithm variations
- Architectural features
- Identify critical parameters

Validate performance with hardware simulators/measurements

Document tradeoffs
- Input to vendors
- Helps define programming model requirements

Refine and iterate

Applications

Proxy Applications

Caveat
- Proxy apps are designed to address a specific co-design issue.
- Union of proxy apps is not a complete characterization of application
- Anticipated methodology for exascale not fully captured by current full applications

Proxies
- Compressible Navier Stokes without species
  - Basic test for stencil operations, primarily at node level
  - Coming soon – generalization to multispecies with reactions (minimalist full application)
- Multigrid algorithm – 7 point stencil
  - Basic test for network issues
  - Coming soon – denser stencils
- Chemical integration
  - Kernel test for local, computationally intense kernel
- Others coming soon
  - Integrated UQ kernels
  - Skeletal model of full workflow
  - Visualization / analysis proxy apps

Visualization/Topology/Statistics Proxy Applications

Proxies are algorithms with flexibility to explore multiple execution models
- Multiple strategies for local computation algorithms
- Support for various merge/broadcast communication patterns

Topological analysis
- Three phases (local compute/communication/feature-based statistics)
- Low/no flops, highly branching code
- Compute complexity is data dependent
- Communication load is data dependent
- Requires gather/scatter of data

Visualization
- Two phases (local compute/image compositing)
- Moderate FLOPS
- Compute complexity is data dependent
- Communication load is data dependent
- Requires gather

Statistics
- Two phases (local compute/aggregation)
- Compute is all FLOPs
- Communication load is constant and small
- Requires gather, optional scatter of data

ExaCT: Difference between revisions

From Modelado Foundation

Latest revision as of 16:47, February 13, 2013

Contents

Introduction

Physics of Gas-Phase Combustion represented by PDE’s

Code Base

Adaptive Mesh Refinement

Preliminary Observations

How Core Numerics Will Change

Data Analysis

Co-design Process

Applications

Proxy Applications

Visualization/Topology/Statistics Proxy Applications

Kernel Use

Description

Download

@@ Line 1: / Line 1: @@
 {{Infobox  Co-design
-|name = ExMatEX
+|name = ExaCT
-|image = Location to an image/logo (if any)
+|image = [[File:ExaCTWebBanner.jpg|400px]]
-|imagecaption = Image Caption
+|imagecaption =
-|developer = LANL, LLNL, ORNL, SNL, Standford University
+|developer = [http://www.lbl.gov/ LBNL], [http://www.sandia.gov/ SNL], [http://www.lanl.gov/ LANL], [http://www.ornl.gov/ ORNL], [https://www.llnl.gov/ LLNL], [http://www.nrel.gov/ NREL], [http://www.rutgers.edu/ Rutgers U.], [https://www.utexas.edu/ UT Austin], [http://www.gatech.edu/ Georgia Tech], [http://www.stanford.edu/ Standford U.], [http://www.utah.edu/ U. of Utah]
-|latest_release_version = x.y.z
+|latest_release_version = version x.y.z
 |latest_release_date = Latest Release Date here
 |operating_system = Linux, Unix, etc.
 |genre = Computational Chemistry?
 |license = Open Source or else?
-|website = URL here
+|website = [http://exactcodesign.org http://exactcodesign.org]
 }}
-'''ExMatEX''' (Exascale Co-Design Center for Materials in Extreme Environments)
+== Introduction ==
+=== Physics of Gas-Phase Combustion represented by PDE’s ===
+[[File:ExaCT-Gas-Phase-Combustion.png|right|250px]]
+* Focus on gas phase combustion in both compressible and low-Mach limits
+* Fluid mechanics
+** Conservation of mass
+** Conservation of momentum
+** Conservation of energy
+* Thermodynamics
+** Pressure, density, temperature relationships for multicomponent mixtures
-== Application ==
+* Chemistry
+** Reaction kinetics
-== Kernel Name ==
+* Species transport
+** Diffusive transport of different chemical species within the flame
-== Description ==
+=== Code Base ===
+* S3D
+** Fully compressible Navier Stokes
+** Eighth-order in space, fourth order in time
+** Fully explicit, uniform grid
+** Time step limited by acoustics / chemical time scales
+** Hybrid implementation with MPI + OpenMP
+** Implemented for Titan at ORNL using OpenACC
+* LMC
+** Low Mach number formulation
+** Projection-based discretization strategy
+** Second-order in space and time
+** Semi-implicit treatment of advection and diffusion
+** Time step based on advection velocity
+** Stiff ODE integration methodology for chemical kinetics
+** Incorporates block-structured adaptive mesh refinement
+** Hybrid implementation with MPI + OpenMP
+* Target is computational model that supports compressible and low Mach number AMR simulation with integrated UQ
+=== Adaptive Mesh Refinement ===
+[[File:ExaCT-AMR.png|right|300px]]
+* Need for AMR
+** Reduce memory
+** Scaling analysis – For explicit schemes flops scale with memory ^ 4/3
+* Block-structured AMR
+** Data organized into logically-rectangular structured grids
+** Amortize irregular work
+** Good match for multicore architectures
+* AMR introduces extra algorithm issues not found in static codes
+** Metadata manipulation
+** Regridding operations
+** Communications patterns
+=== Preliminary Observations ===
+* Need to rethink how we approach PDE discretization methods for multiphysics applications
+** Exploit relationship between scales
+** More concurrency
+** More locality with reduced synchronization
+** Less memory / FLOP
+** Analysis of algorithms has typically been based on a performance = FLOPS paradigm – can we analyze algorithms in terms of a more realistic performance model
-== Download ==
+* Need to integrate analysis with simulation
+** Combustion simulations are data rich
+** Writing data to disk for subsequent analysis is currently near infeasibility
+** Makes simulation look much more like physical experiments in terms of methodology
+* Current programming models are inadequate for the task
+** We describe algorithms serially and add things to express parallelism at different levels of the algorithm
+** We express codes in terms of FLOPS and let the compiler figure out the data movement
+** Non-uniform memory access is already an issue but programmers can’t easily control data layout
-== Co-Design Agile Development Cycle ==
+* Need to evaluate tradeoffs in terms of potential architectural features
-Creation of a functional exascale simulation environment requires our co‐design process to be adaptive, iterative, and lightweight – i.e., agile.
-[[File:ExaCT-Co-design cycle.png]]
+=== How Core Numerics Will Change ===
+* Core numerics
+** Higher-order for low Mach number formulations
+** Improved coupling methodologies for multiphysics problems
+** Asynchronous treatment of physical processes
+* Refactoring AMR for the exascale
+** Current AMR characteristics
+*** Global flat metadata
+*** Load-balancing based on floating point work
+*** Sequential treatment of levels of refinement
+** For next generation
+*** Hierarchical, distributed metadata
+*** Consider communication cost as part of load balancing for more realistic estimate of work (topology aware)
+*** Regridding includes cost of data motion
+*** Statistical performance models
+*** Alternative time-stepping algorithm – treat levels simultaneously
+=== Data Analysis ===
+[[File:ExaCT-Data-Analysis.png|right|300px]]
+* Current simulations produce 1.5 Tbytes of data for analysis at each time step (Checkpoint data is 3.2 Tbytes)
+** Archiving data for subsequent analysis is currently at limit of what can be done
+** Extrapolating to the exascale, this becomes completely infeasible
-== Physics Fidelity ==
+* Need to integrate analysis with simulation
-Exascale is about better Physics Fidelity.
+** Design the analysis to be run as part of the simulation definition
+*** Visualizations
+*** Topological analysis
+*** Lagrangian tracer particles
+*** Local flame coordinates
+*** Etc.
-Engineering assessment of materials behavior is limited by physics fidelity.
+* Approach based on hybrid staging concept
+** Incorporate computing to reduce data volume at different stages along the path from memory to permanent file storage
-[[File:ExaCT-Physics_Fidelity.png]]
+== Co-design Process ==
+[[File:ExaCT-Co-design_Process.png|right|400px]]
+* Identify key simulation element
+** Algorithmic
+** Software
+** Hardware
-High fidelity adaptive materials simulation is a direct multi-scale embedding of fine-scale simulation into coarse scale simulation.
+* Define representative code (proxy app)
-[[File:ExaCT-High Fidelity Simulation.png]]
+* Analytic performance model
+** Algorithm variations
+** Architectural features
+** Identify critical parameters
+* Validate performance with hardware simulators/measurements
-== Direct Multi-scale Embedding ==
+* Document tradeoffs
-[[File:ExaCT-CSM-FSM.png|right]]
+** Input to vendors
-Direct multi-scale embedding requires full utilization of exascale concurrency and locality.
+** Helps define programming model requirements
-* ''Brute force multi-scale coupling:'' Full fine scale model (FSM, e.g. a crystal plasticity model) run for every zone & time step of coarse scale mode (CSM, e.g. an ALE code)
-* ''Adaptive Sampling:''
-** Save FSM results in database
-** Before running another FSM, check database for FSM results similar enough to those needed that interpolation or extrapolation suffices
-** Only run full FSM when results in database not close enough
-* Heterogeneous, hierarchical '''MPMD''' algorithms map naturally to anticipated heterogeneous, hierarchical architectures
-* Escape the traditional bulk synchronous SPMD paradigm, improve '''scalability''' and reduce '''scheduling'''
-* Task-based MPMD approach leverages '''concurrency''' and '''heterogeneity''' at exascale while enabling novel '''data models''', '''power management''', and '''fault tolerance''' strategies
+* Refine and iterate
-== Metrics ==
-Metrics for computational work measure the behavior of the code within the computational ecosystem (e.g. HW/Stack/Compiler/etc.)
-* Pin is a tool that measures utilization of specific functional units in the processor (e.g. floating point operations)
-* Both ddcMD and LULESH are highly optimized codes. Pin analysis on entire code suite (see VG 3) in progress
-* Analysis for Intel Sandy Bridge processor with Intel compiler (cab)
-* LULESH percent vector utilizaMon: ''Intel compiler = 8.7%, GCC = 0.15%'' (of FP)
-[[File:ExaCT-Metrics.png]]
+== Applications ==
+=== Proxy Applications ===
+* Caveat
+** Proxy apps are designed to address a specific co-design issue.
+** Union of proxy apps is not a complete characterization of application
+** Anticipated methodology for exascale not fully captured by current full applications
+* Proxies
+** Compressible Navier Stokes without species
+*** Basic test for stencil operations, primarily at node level
+*** Coming soon – generalization to multispecies with reactions (minimalist full application)
+** Multigrid algorithm – 7 point stencil
+*** Basic test for network issues
+*** Coming soon – denser stencils
+** Chemical integration
+*** Kernel test for local, computationally intense kernel
+** Others coming soon
+*** Integrated UQ kernels
+*** Skeletal model of full workflow
+*** Visualization / analysis proxy apps
-== Productive Exascale Simulation ==
+=== Visualization/Topology/Statistics Proxy Applications ===
-[[File:ExaCT-Coordinate_Efforts.png|right]]
+* Proxies are algorithms with flexibility to explore multiple execution models
-Productive Exascale Simulation requires the coordinated efforts of Domain Scientists, Computer Scientists and Hardware Developers.
+** Multiple strategies for local computation algorithms
-* Many, many‐task coordination issues
+** Support for various merge/broadcast communication patterns
-** Greater than one hundred million, more is different
-** Synchronization (essential for time evolution)
-** Stalls (keeping everyone working)
-*Better exposure into hardware details for the exascale application developer
-** Compiler Interface
-** Simulators+Emulators+Tools measure code/ecosystem metrics
-*** Are we defining the right metrics?
-* Application developers need a better way to express (code) the computational work of the application into the exascale computational ecosystem
-** Better programming models (e.g. domain specific languages)
-** Runtime support for heterogeneous multi‐program, multi‐data (MPMD) applications
-* The petascale science apps are NOT general apps. They have been painfully optimized for the petascale architecture by the application developer. How do we get exascale lessons learned into quotidian science applications (VASP, LAMMPs, …)?
-* The petascale codes already account for data movement, it is only going to get worse
-** Bandwidth to memory is scaling slower than compute
-** Memory access is dominating power
-* The exascale codes will need to learn to adaptively respond to the system
-** Fault tolerance, process difference, power management, …
+* Topological analysis
+** Three phases (local compute/communication/feature-based statistics)
+** Low/no flops, highly branching code
+** Compute complexity is data dependent
+** Communication load is data dependent
+** Requires gather/scatter of data
-== Lesson Learned ==
+* Visualization
-What did we learn from creating petascale science apps and what does that mean for exascale?
+** Two phases (local compute/image compositing)
-* '''Problem:''' Fault tolerance is a problem at 10<sup>5</sup> and will be a much bigger problem at 10<sup>8</sup>:
+** Moderate FLOPS
-** '''Solution:''' Application assisted error recovery
+** Compute complexity is data dependent
-*** Parity error triggers exception handler (like FPE)
+** Communication load is data dependent
-*** Application knows what memory is “important” can catch exception and repair data
+** Requires gather
-** '''Exascale:''' Runtime will need to support task migration across nodes
-* '''Problem:''' Scaling (absolutely crucial for exascale) requires very very good load balancing:
-** '''Solution:''' Decomposition based on Computational Work
-*** Particle-based domain decomposition - processors own particles, not regions - allows decomposition to persist through atom movement
-*** Maintain minimum communication list for given decomposition - allows extended range of “interaction”
-*** Arbitrary domain shape - allows minimal surface to volume ratio for communication
-** '''Exascale:''' Decomposition has to become dynamic and adaptive
-* '''Problem:''' HW specific algorithms are crucial for performance but limit portability
-** E.g. Linked cells map better to current petascale systems than neighbor lists
-** Ordering neighbors within a cell exposes SIMD parallelism
-* '''Problem:''' I/O does not work with too many files or one large file
-** '''Solution:''' Divide and concur, what is the optimal number of files?
-** '''Exascale:''' Dedicated checkpoint filesystem (flash?)
+* Statistics
+** Two phases (local compute/aggregation)
+** Compute is all FLOPs
+** Communication load is constant and small
+** Requires gather, optional scatter of data
-== Workflow of Co-Design ==
+== Kernel Use ==
-Model for the Workflow of Co-Design between Application Co-Design Centers, Vendors, and the broader Research Community.
-[[File:ExaCT-Workflow of Co-design.png]]
+== Description ==
+== Download ==