Actions

X-TUNE: Difference between revisions

From Modelado Foundation

imported>Cdenny
(Created page with "{{Infobox project | title = X-TUNE | image = 180px | imagecaption = | team-members = List of team members | pi = Lead PI (Institute) | co-pi = Co-...")
 
imported>Cdenny
No edit summary
Line 1: Line 1:
{{Infobox project
{{Infobox project
| title = X-TUNE
| title = X-TUNE
| image = [[File:Your-team-logo.png|180px]]
| image = [[File:XTUNE-Logos.png|300px]]
| imagecaption =  
| imagecaption =  
| team-members = List of team members
| team-members = [http:///www.utah.edu/ U. of Utah], [http://www.anl.gov/ ANL], [http://www.lbl.gov/ LBNL], [http://www.isi.edu/ USC/ISI]
| pi = Lead PI (Institute)
| pi = Mary Hall (U. of Utah)
| co-pi = Co-PIs (Institute)
| co-pi = Paul Hovland (ANL), Stefan Wild (ANL), Krishna Narayanan (ANL), Jeff Hammond (ANL), Lenny Oliker (LBNL), Sam Williams (LBNL), Brian van Straalen (LBNL), Jacqueline Chame (USC/ISI)
| website = team website
| website = [http://ctop.cs.utah.edu/x-tune/ http://ctop.cs.utah.edu/x-tune/]
}}
}}


''Description about your project goes here.....''
'''Autotuning for Exascale''' or '''X-TUNE''' - ''Self-Tuning Software to manage Heterogeneity''
 


== Team Members ==
== Team Members ==
* [http:///www.utah.edu/ University of Utah]
* [http://www.anl.gov/ Argonne National Laboratory (ANL)]
* [http://www.lbl.gov/ Lawrence Berkeley National Laboratory (LBNL)]
* [http://www.isi.edu/ University of Southern California: Information Science Institute (USC/ISI)]
== What is Autotuning? ==
'''Definition:'''
* Automatically generate a “search space” of possible implementations of a computation
** A ''code variant'' represents a unique implementation of a computation, among many
** A ''parameter'' represents a discrete set of values that govern code generation or execution of a variant
* Measure execution time and compare
* Select the best-performing implementation (for exascale, tradeoff between performance/energy/reliability)
'''Key Issues:'''
* Identifying the search space
* Pruning the search space to manage costs
* Off-line vs. on-line search
'''Three Types of Autotuning Systems'''
* Autotuning libraries
** Library that encapsulates knowledge of its performance under different execution environments
** Dense linear algebra: ''ATLAS, PhiPAC''
** Sparse linear algebra: ''OSKI''
** Signal processing: ''SPIRAL, FFTW''
* Application-specific autotuning
** ''Active Harmony'' provides parallel rank order search for tunable parameters and variants
** ''Sequoia'' and ''PetaBricks'' provide language mechanism for expressing tunable parameters and variants
* Compiler-based autotuning
** Other examples: Saday et al., Swany et al., Eignenmann et al.
** Related concepts: iterative compilation, learning-based compilation
== X-TUNE Goals ==
A unified autotuning framework that seamlessly integrates programmer-directed and compiler-directed autotuning
* Expert programmer and compiler work collaboratively to tune a code
** Unlike previous systems that place the burden on either programmer or compiler
** Provides access to compiler optimizations, offering expert programmers the control over optimization they so often desire
* Design autotuning to be encapsulated in domain-specific tools
** Enables less-sophisticated users of the software to reap the benefit of the expert programmers’ efforts
* Focus on Adaptive Mesh Refinement Multigrid (Combustion Co-Design Center, BoxLib, Chombo) and tensor contractions (TCE)
== X-TUNE Structure ==
[[File:XTUNE-Structure.png|600px]]
== Autotuning Language Extensions ==
* Tunable Variables
** An annotation on the type of a variable (as in Sequoia)
** Additionally, specify range, constraints and a default value
* Computation Variants
** An annotation on the type of a function (as in PetaBricks)
** Additionally, specify (partial) selection criteria
** Multiple variants may be composed in the same execution
''Separate mapping description captures architecture-specific aspects of autotuning.''
== Compiler-Based Autotuning ==
* Foundational Concepts
** Identify search space through a high-level description that captures a large space of possible implementations
** Prune space through compiler domain knowledge and architecture features
** Provide access to programmers with ''transformation recipes'', or recipes generated automatically by compiler decision algorithm
** Uses source-to-source transformation for portability, and to leverage vendor code generation
** Requires ''restructuring of the compiler''
== CHiLL Implementation ==
[[File:XTUNE-CHiLL.png|600px]]
== Transformation Recipes for Autotuning ==
'''Incorporate the Best Ideas from Manual Tuning'''
[[File:XTUNE-Transformation-Recipes.png|600px]]
'''Compiler + Autotuning can yield comparable and even better performance than manually-tuned libraries'''
[[File:XTUNE-Results.png|600px]]
== Pbound: Performance Modeling for Autotuning ==
[[File:XTUNE-Pbound.png|right|200px]]
* Performance modeling increases the automation in autotuning
** Manual transformation recipe generation is tedious and error-prone
** Implicit models are not portable across platforms
* Models can unify programmer guidance and compiler analysis
** Programmer can invoke integrated models to guide autotuning from application code
** Compiler can invoke models during decision algorithms
* Models optimize autotuning search
** Identify starting points
** Prune search space to focus on most promising solutions
** Provide feedback from updates in response to code modifications
=== Reuse Distance and Cache Miss Prediction ===
'''Reuse distance'''
* For regular (affine) array references
** Compute reuse distance, to predict data footprints in memory hierarchy
** Guides transformation and data placement decisions
[[File:XTUNE-Pbound-Reuse-Distance.png]]
'''Cache miss prediction'''
* Use to predict misses
* Assuming fully associative cache with n lines (optimistic case), a reference will hit if the reuse distance d<n
[[File:XTUNE-Pbound-Cache-Miss.png]]
=== Application Signatures + Architecture ===
[[File:XTUNE-Pbound-Example.png|600px]]
== How will modeling be used? ==
* Single-core and multicore models for application performance will combine architectural information, user-guidance, and application analysis
* Models will be coupled with decision algorithms to automatically generate CHiLL transformation recipes
** Input: Reuse Information, Loop Information etc.
** Output: Set of transformation scripts to be used by empirical search
* Feedback to be used to refine model parameters and behavior
* Small and large application execution times will be considered
'''Example: Stencils and Multigrid'''
* Stencil performance bound, when bandwidth limited:
  '''''Performance (gflops) <= stencil flops * STREAM bandwidth / grid size'''''
* Multigrid solves Au=f by calculating a number of corrections to an initial solution at varying grid coarsenings (“V-cycle”)
** Each level in the v-cycle: perform 1-4 relaxes (~stencil sweeps)
** Repeat multiple v-cycles reducing the norm of the residual by an order of magnitude each cycle
[[File:XTUNE-Example.png|600px]]




== Objectives ==
'''Multigrid and Adaptive Mesh Refinement'''


[[File:XTUNE-Multigrid.png|right|200px]]
* Some regions of the domain may require finer fidelity than others
* In Adaptive Mesh Refinement, we refine those regions to a higher resolution in time and space
* Typically, one performs a multigrid “level solve” for one level (green, blue, red) at a time
* Coarse-fine boundaries (neighboring points can be at different resolutions) complicate the calculation of the RHS and ghost zones for the level
* Each level is a collection of small (32<sup>3</sup> or 64<sup>3</sup>) boxes to minimize unnecessary work
* These boxes will be distributed across the machine for load balancing (neighbors are not obvious/implicit)


== Roadmap ==


== Autotuning for AMR Multigrid ==
* Focus is addressing data movement, multifaceted:
** Automate fusion of stencils within an operator. Doing so may entail aggregation of communication (deeper ghost zones)
** Extend and automate the communication-avoiding techniques developed in CACHE
** Automate application of data movement-friendly coarse-fine boundary conditions
** Automate hierarchical parallelism within a node to AMR MG codes
** Explore alternate data structures
** Explore alternate stencil algorithms (higher order, …)
* Proxy architectures: MIC, BG/Q, GPUs
* Encapsulate into an embedded DSL approach


== Impact ==


== Summary and Leverage ==
* Build integrated end-to-end autotuning, focused on AMR multigrid and tensor contractions
** Language and compiler guidance of autotuning
** Programmer and compiler collaborate to tune a code
** Modeling assists programmer, compiler writer, and search space pruning


== Software Stack ==
* Leverage and integrate with other X-Stack teams
** Our compiler technology all based on ROSE so can leverage from and provide capability to ROSE
** Domain-specific technology to facilitate encapsulating our autotuning strategies
** Collaborate with MIT on autotuning interface
** Common run-time for a variety of platforms (e.g., GPUs and MIC), and supports a large number of potentially hierarchical threads

Revision as of 21:29, February 11, 2013

X-TUNE
File:XTUNE-Logos.png
Team Members U. of Utah, ANL, LBNL, USC/ISI
PI Mary Hall (U. of Utah)
Co-PIs Paul Hovland (ANL), Stefan Wild (ANL), Krishna Narayanan (ANL), Jeff Hammond (ANL), Lenny Oliker (LBNL), Sam Williams (LBNL), Brian van Straalen (LBNL), Jacqueline Chame (USC/ISI)
Website http://ctop.cs.utah.edu/x-tune/
Download {{{download}}}

Autotuning for Exascale or X-TUNE - Self-Tuning Software to manage Heterogeneity


Team Members


What is Autotuning?

Definition:

  • Automatically generate a “search space” of possible implementations of a computation
    • A code variant represents a unique implementation of a computation, among many
    • A parameter represents a discrete set of values that govern code generation or execution of a variant
  • Measure execution time and compare
  • Select the best-performing implementation (for exascale, tradeoff between performance/energy/reliability)

Key Issues:

  • Identifying the search space
  • Pruning the search space to manage costs
  • Off-line vs. on-line search

Three Types of Autotuning Systems

  • Autotuning libraries
    • Library that encapsulates knowledge of its performance under different execution environments
    • Dense linear algebra: ATLAS, PhiPAC
    • Sparse linear algebra: OSKI
    • Signal processing: SPIRAL, FFTW
  • Application-specific autotuning
    • Active Harmony provides parallel rank order search for tunable parameters and variants
    • Sequoia and PetaBricks provide language mechanism for expressing tunable parameters and variants
  • Compiler-based autotuning
    • Other examples: Saday et al., Swany et al., Eignenmann et al.
    • Related concepts: iterative compilation, learning-based compilation


X-TUNE Goals

A unified autotuning framework that seamlessly integrates programmer-directed and compiler-directed autotuning

  • Expert programmer and compiler work collaboratively to tune a code
    • Unlike previous systems that place the burden on either programmer or compiler
    • Provides access to compiler optimizations, offering expert programmers the control over optimization they so often desire
  • Design autotuning to be encapsulated in domain-specific tools
    • Enables less-sophisticated users of the software to reap the benefit of the expert programmers’ efforts
  • Focus on Adaptive Mesh Refinement Multigrid (Combustion Co-Design Center, BoxLib, Chombo) and tensor contractions (TCE)


X-TUNE Structure

XTUNE-Structure.png


Autotuning Language Extensions

  • Tunable Variables
    • An annotation on the type of a variable (as in Sequoia)
    • Additionally, specify range, constraints and a default value
  • Computation Variants
    • An annotation on the type of a function (as in PetaBricks)
    • Additionally, specify (partial) selection criteria
    • Multiple variants may be composed in the same execution

Separate mapping description captures architecture-specific aspects of autotuning.


Compiler-Based Autotuning

  • Foundational Concepts
    • Identify search space through a high-level description that captures a large space of possible implementations
    • Prune space through compiler domain knowledge and architecture features
    • Provide access to programmers with transformation recipes, or recipes generated automatically by compiler decision algorithm
    • Uses source-to-source transformation for portability, and to leverage vendor code generation
    • Requires restructuring of the compiler


CHiLL Implementation

XTUNE-CHiLL.png


Transformation Recipes for Autotuning

Incorporate the Best Ideas from Manual Tuning

XTUNE-Transformation-Recipes.png


Compiler + Autotuning can yield comparable and even better performance than manually-tuned libraries

XTUNE-Results.png


Pbound: Performance Modeling for Autotuning

XTUNE-Pbound.png
  • Performance modeling increases the automation in autotuning
    • Manual transformation recipe generation is tedious and error-prone
    • Implicit models are not portable across platforms
  • Models can unify programmer guidance and compiler analysis
    • Programmer can invoke integrated models to guide autotuning from application code
    • Compiler can invoke models during decision algorithms
  • Models optimize autotuning search
    • Identify starting points
    • Prune search space to focus on most promising solutions
    • Provide feedback from updates in response to code modifications

Reuse Distance and Cache Miss Prediction

Reuse distance

  • For regular (affine) array references
    • Compute reuse distance, to predict data footprints in memory hierarchy
    • Guides transformation and data placement decisions

XTUNE-Pbound-Reuse-Distance.png

Cache miss prediction

  • Use to predict misses
  • Assuming fully associative cache with n lines (optimistic case), a reference will hit if the reuse distance d<n

XTUNE-Pbound-Cache-Miss.png

Application Signatures + Architecture

XTUNE-Pbound-Example.png

How will modeling be used?

  • Single-core and multicore models for application performance will combine architectural information, user-guidance, and application analysis
  • Models will be coupled with decision algorithms to automatically generate CHiLL transformation recipes
    • Input: Reuse Information, Loop Information etc.
    • Output: Set of transformation scripts to be used by empirical search
  • Feedback to be used to refine model parameters and behavior
  • Small and large application execution times will be considered


Example: Stencils and Multigrid

  • Stencil performance bound, when bandwidth limited:
 Performance (gflops) <= stencil flops * STREAM bandwidth / grid size
  • Multigrid solves Au=f by calculating a number of corrections to an initial solution at varying grid coarsenings (“V-cycle”)
    • Each level in the v-cycle: perform 1-4 relaxes (~stencil sweeps)
    • Repeat multiple v-cycles reducing the norm of the residual by an order of magnitude each cycle


XTUNE-Example.png


Multigrid and Adaptive Mesh Refinement

  • Some regions of the domain may require finer fidelity than others
  • In Adaptive Mesh Refinement, we refine those regions to a higher resolution in time and space
  • Typically, one performs a multigrid “level solve” for one level (green, blue, red) at a time
  • Coarse-fine boundaries (neighboring points can be at different resolutions) complicate the calculation of the RHS and ghost zones for the level
  • Each level is a collection of small (323 or 643) boxes to minimize unnecessary work
  • These boxes will be distributed across the machine for load balancing (neighbors are not obvious/implicit)


Autotuning for AMR Multigrid

  • Focus is addressing data movement, multifaceted:
    • Automate fusion of stencils within an operator. Doing so may entail aggregation of communication (deeper ghost zones)
    • Extend and automate the communication-avoiding techniques developed in CACHE
    • Automate application of data movement-friendly coarse-fine boundary conditions
    • Automate hierarchical parallelism within a node to AMR MG codes
    • Explore alternate data structures
    • Explore alternate stencil algorithms (higher order, …)
  • Proxy architectures: MIC, BG/Q, GPUs
  • Encapsulate into an embedded DSL approach


Summary and Leverage

  • Build integrated end-to-end autotuning, focused on AMR multigrid and tensor contractions
    • Language and compiler guidance of autotuning
    • Programmer and compiler collaborate to tune a code
    • Modeling assists programmer, compiler writer, and search space pruning
  • Leverage and integrate with other X-Stack teams
    • Our compiler technology all based on ROSE so can leverage from and provide capability to ROSE
    • Domain-specific technology to facilitate encapsulating our autotuning strategies
    • Collaborate with MIT on autotuning interface
    • Common run-time for a variety of platforms (e.g., GPUs and MIC), and supports a large number of potentially hierarchical threads