# Reporting Jim Belak's Homework Assignment

# Traleika Glacier (X-Stack)

https://sites.google.com/site/traleikaglacierxstack/

TG Team March 20, 2013

Assignment: What does an application developer need to know about your HW/SW approach...

# HW-SW Co-design

Applications and SW stack provide guidance for efficient system design

**Applications** 

**Execution Model** 

**Programming Sys** 

Architecture

Circuits & Design

Limitations, issues and opportunities to exploit

# Today's HW System Architecture



260 pJ/F ⇒ 260 MW/Exa
55 μB of local memory/F

Today's programming model comprehends
this system architecture

# **Technology Challenges**

NTV reduces energy but exacerbates variations



Small & Fast cores Random distribution Temp dependent

Limited NTV for arrays (memory) due to stability issues



Disproportionate
Memory arrays can
be made larger

3. On-die Interconnect energy (per mm) does not reduce as much as compute



6X compute
1.6X interconnect

4. At NTV, leakage power is substantial portion of the total power



Expect 50% leakage Idle hardware consumes energy

5. DRAM energy scales, but not enough



50 pJ/b today 8 pJ/b demonstrated Need < 2pJ/b

System interconnect limited by laser energy and cost



BW tapering and locality awareness necessary

## Straw-man Architecture





Application specific

#### **Control Engine (CE)**



System SW

#### Block (8 XE + CE)



#### **Cluster (16 Blocks)**



#### **Processor Chip (16 Clusters)**



|   | Technology                            | 7nm, 2018                    |  |  |  |
|---|---------------------------------------|------------------------------|--|--|--|
|   | Die area                              | 500 mm2                      |  |  |  |
|   | XE/die                                | 2048                         |  |  |  |
|   | Frequency                             | 4.2 GHz@Vdd, 600 MHz@50% Vdd |  |  |  |
|   | TFLOPs                                | 17.2@Vdd, 2.5@50% Vdd        |  |  |  |
|   | Power*                                | 600 W@Vdd, 37 W@50% Vdd      |  |  |  |
|   | E Efficiency*                         | 34 pJ/F@Vdd, 15 pJ/F@50% Vdd |  |  |  |
|   | Memory B/F                            | 39 μB/F@Vdd, 268 μB/F@50%Vdd |  |  |  |
| _ | * With out intercept of late required |                              |  |  |  |

\* Without interconnect (data movement)

## Wide dynamic range in HW

# Interconnect Structures and Topologies



# **Data Movement Energy**



8 B/Flop at the system level Naïve BW tapering Geometric—4X each level



Almost constant/hierarchy
System level dominates
Disproportionately large

## **Intelligent BW tapering is necessary**

# Intelligent BW Tapering

#### BW tapering inversely proportional to the performance





8 B/Flop at the system level Tapering increases/hierarchy This is counterintuitive...!

Decreases with hierarchy Meets system power goal

## Data locality—key challenge for software

# Exascale Data Movement Power (with Tapering)



## Simulators Capture Straw-man Architecture

| Simulator                            | Pros                                                                                                                                         | Cons                                                                                                                       |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| OCR based<br>behavioral<br>simulator | Near-native application code execution<br>on host processor<br>Rapid application development<br>Epoch statistics as well as total statistics | Does not model real architecture Does not model advanced ISA features Does not reflect expected timing of simulated system |
| Fsim Functional Simulator            | Models HW units, simple timing model<br>Complete statistics and trace file<br>1-10 MIPS per core speed<br>Massively parallel and distributed | Lower speed, highly detailed                                                                                               |

| Tool               | Purpose                                                                                                                                                       | Advantage                                                                                                     | Weakness                                                                                                                                      |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| Power<br>Estimator | Uses statistics/counters to make<br>energy and area estimates for<br>application behavior                                                                     | <ul> <li>Scales from 45nm to 7nm projections</li> <li>Automatic analysis of outputs from FSim runs</li> </ul> | <ul> <li>Only models dynamic<br/>power, uses circuit models<br/>for leakage</li> <li>Calibrated to existing<br/>commercial devices</li> </ul> |
| Memory<br>Analyzer | <ul> <li>Detailed models for cache and/or<br/>scratchpad hierarchies, various<br/>levels &amp; types of coherence</li> <li>Compares configurations</li> </ul> | Enables limited behavioral trace power estimation                                                             | <ul> <li>Does not model Instruction<br/>fetch/execution</li> <li>Limited to behavioral<br/>memory traces at this time</li> </ul>              |

## Tools almost ready: LLVM, Utilities, etc.

- 1. Extreme parallelism (1000X due to Exa, additional 4X due to NTV)
- 2. Data locality—reduce data movement
- 3. Intelligent scheduling—move thread to data if necessary
- 4. Fine grain resource management (objective function)
- 5. Applications and algorithms incorporate paradigm change

# Programming & Execution Model

### **Event driven tasks (EDT)**

Dataflow inspired, tiny codelets (self contained)

Non blocking, no preemption

#### **Programming model:**

Separation of concerns: Domain specification & HW mapping

Express data locality with hierarchical tiling

Global, shared, non-coherent address space

Optimization and auto generation of EDTs (HW specific)

#### **Execution model:**

Dynamic, event-driven scheduling, non-blocking

Dynamic decision to move computation to data

Observation based adaption (self-awareness)

Implemented in the runtime environment

#### Separation of concerns:

<u>User application, control, and resource management</u>

## Traleika Glacier SW Stack

Concurrent Collections

Hierarchical Tiled Arrays

Habanero-C

**Express** 

**Programming System** 

Parallel Intermediate Language

R-Stream Optimizations

**Optimize** 

**HW Mapping and Tuning** 

Map to HW

## **Execution model with introspection**

Intel Research Runtime

> SWARM (ETI)

Open Community Runtime (OCR)

Habanero Runtime (Rice)

DAR<sup>3</sup>TS (Delaware)

Behavioral Simulator Functional Simulator (Fsim)

Model Evaluate

# Over-provisioning, Introspection, Self-awareness

#### **Addressing variations**



- 1. Provide more compute HW
- 2. Law of large numbers
- 3. Static profile

#### Fine grain resource mgmt



Dynamic reconfiguration:

- 1. Energy efficiency
- Latency
- 3. Dynamic resource management

#### **Sensors for introspection**

Processor Chip (16 Clusters)



- 1. Energy consumption
- 2. Instantaneous power
- 3. Computations
- 4. Data movement
- 1. Schedule threads based on objectives and resources
- 2. Dynamically control and manage resources
- 3. Identify sensors, functions in HW for implementation System SW implements introspective execution model

## Over-provisioned Introspectively Resource Managed System





## X-Stack Components Put Together



# HW-SW Co-design

Applications and SW stack provide guidance for efficient system design

**Applications** 

**Execution Model** 

**Programming Sys** 

Architecture

Circuits & Design

Limitations, issues and opportunities to exploit

17

## Summary

- Straw-man architecture comprehends technology challenges
- Simulators capture the architecture, ready for evaluation
- Tools and infrastructure are getting ready
- Software stack is making good progress
- Getting ready for thorough evaluation