# Learning to Live with Errors

# Architectural Solutions for Memory Reliability at Extreme Scaling

# Prashant J. Nair



# **MOORE'S LAW IN COMPUTING**

Technology Scaling  $\rightarrow$  2x transistors every ~2 years



# **MOORE'S LAW IN COMPUTING**

Technology Scaling  $\rightarrow$  2x transistors every ~2 years



#### Improving the Performance of Computing Systems

# **MOORE'S LAW IN MEMORY SYSTEMS**

Technology Scaling: A key driver for applications



# **MOORE'S LAW IN MEMORY SYSTEMS**

#### Technology Scaling: A key driver for applications



# **MOORE'S LAW IN MEMORY SYSTEMS**

#### Technology Scaling: A key driver for applications



DRAM Chip

#### Moore's Law is vital for High-Density Memories













#### Client, Server and IoT devices $\rightarrow$ Scalable Memories

### **CHALLENGES IN MEMORY SCALING**



Source: Lim et al., ISCA 2009.

# **CHALLENGES IN MEMORY SCALING**



Source: Lim et al., ISCA 2009.

Per-core DRAM capacity reduces by 30% every 2 years

# **CHALLENGES IN MEMORY SCALING**



Source: Lim et al., ISCA 2009.

#### Per-core DRAM capacity reduces by 30% every 2 years

Moore's Law in Memory Systems  $\rightarrow$  Scaling Wall









#### High aspect ratio $\rightarrow$ Faulty cells (Scaling Faults)

### **RUNTIME FAULTS**

Faults that happen while the machine is operating



DRAM Chip

















# **GRANULARITY OF FAULTS**



## **GRANULARITY OF FAULTS**



#### Two types of Runtime Faults @ Many Granularities

### **RUNTIME FAULTS ARE PERVASIVE**



#### Efficient solutions to mitigate runtime failures

#### GOAL

- Hurdles for Moore's Law: Scaling & Runtime Faults
- Conventional techniques: Costly/Ineffective

#### GOAL

- Hurdles for Moore's Law: Scaling & Runtime Faults
- Conventional techniques: Costly/Ineffective

Ultra low-cost solutions to sustain Moore's Law in memories using architecture-level approaches



Transient Faults



Transient Faults



Transient Faults



Transient Faults



Low-cost architectural techniques to enable reliable and scalable memory systems  $\rightarrow$  Sustain Moore's Law



**Transient Faults**
# **THIS TALK**



ArchShield: Architectural Framework for Assisting DRAM Scaling By Tolerating High Error-Rates

#### ISCA-2013

**Prashant Nair** Daehyun Kim Moinuddin Qureshi





DRAM

# **PROBLEM: RELIABLE TECHNOLOGY SCALING**



#### Activate a Row of 8KB









#### Memory is accessed in 64B Cachelines $\rightarrow$ 8 Words



Spare Rows and Columns



Enable Spare Rows and Columns

#### Entire row or column sacrificed for a few faulty cells



Error-Rate: 100ppm



Error-Rate: 100ppm



Error-Rate: 100ppm

#### Row/Column sparing has large area overheads

#### Single Error Correct Double Error Detect (SECDED)



#### Single Error Correct Double Error Detect (SECDED)



# Corrects single-bit fault in each word (8 Bytes)

#### For 8GB DIMM $\rightarrow$ 1 Billion words

#### Single Error Correct Double Error Detect (SECDED)



Examining all the words...

#### Single Error Correct Double Error Detect (SECDED)



### High chance of two faults in atleast one word Birthday Paradox

### Single Error Correct Double Error Detect (SECDED)



### High chance of two faults in atleast one word Birthday Paradox

8GB DIMM  $\rightarrow$  1 billion words = N

Expected faults till double-fault = 1.25\*Sqrt(N) = 40K faults  $\rightarrow 0.5 ppm$ 

### Single Error Correct Double Error Detect (SECDED)



### High chance of two faults in atleast one word Birthday Paradox

8GB DIMM  $\rightarrow$  1 billion words = N

Expected faults till double-fault = 1.25\*Sqrt(N) = 40K faults  $\rightarrow 0.5 ppm$ 

### SECDED alone cannot protect against scaling faults

# **SCALING OPTION 3: STRONG ECC**

#### Strong ECC are robust, but complex and costly



Memory requests incur encoding/decoding latency

# **SCALING OPTION 3: STRONG ECC**

#### Strong ECC are robust, but complex and costly



Memory requests incur encoding/decoding latency Bit Error Rate (BER) of 100 ppm: ECC-4 → 50% storage overhead

# **SCALING OPTION 3: STRONG ECC**

#### Strong ECC are robust, but complex and costly



Memory requests incur encoding/decoding latency Bit Error Rate (BER) of 100 ppm: ECC-4 → 50% storage overhead

Strong ECC are inefficient for tolerating errors

### **DISSECTING FAULT PROBABILITIES**



### **DISSECTING FAULT PROBABILITIES**



#### No Fault 1-Bit Fault Multi-Bit Faults

#### Most faults are 1-bit: Exploit skew in probability

# **ARCHSHIELD: AN OVERVIEW**

#### Inspired from SSDs to handle high rates of errors



# **ARCHSHIELD: AN OVERVIEW**

#### Inspired from SSDs to handle high rates of errors



Use Replication Area and Fault-Map to handle faults

# **FAULTS: RUNTIME TESTING & CLASSIFICATION**

During the first bootup, runtime testing is performed Each 8B word gets classified into one of three types:



**Faulty cell information**: Stored in hard drive for future use

# **FAULTS: RUNTIME TESTING & CLASSIFICATION**

During the first bootup, runtime testing is performed Each 8B word gets classified into one of three types:



**Faulty cell information**: Stored in hard drive for future use

Identifies the faulty cells & decides type of correction

### **EXPOSE FAULTS USING A FAULT-MAP**

#### A map to keep track of all words



### **EXPOSE FAULTS USING A FAULT-MAP**

#### A map to keep track of all words



Fault-Map identifies faulty vs non-faulty words

#### 4bits (2-bits replicated) per 8B word

00 → No Fault 01 → Single Bit-Fault 11 → Multi-Bit Fault



#### 4bits (2-bits replicated) per 8B word



#### 4bits (2-bits replicated) per 8B word



#### Per-word Fault-Map has an overhead of 6.4%

#### 4bits (2-bits replicated) per 64B cacheline



#### 4bits (2-bits replicated) per 64B cacheline



Per-line Fault-Map has an overhead of 0.8%

### **FAULT-MAP: OPERATION**










#### **Caching Fault-Map for Performance**



#### **Caching Fault-Map for Performance**



#### **Caching Fault-Map for Performance**



#### **Caching Fault-Map for Performance**



Cache Fault-Map line  $\rightarrow$  Information of 128 lines

# **KEEP REPLICAS OF FAULTY WORDS**

Valid data from faulty words are stored in replicas



# **KEEP REPLICAS OF FAULTY WORDS**

Valid data from faulty words are stored in replicas



# **KEEP REPLICAS OF FAULTY WORDS**

Valid data from faulty words are stored in replicas



Replication area stores valid data of faulty words









Replicas  $\rightarrow$  Anywhere in a contiguous replication area



Looking up a Large Table  $\rightarrow$  High Latency



Replicas  $\rightarrow$  Anywhere in a contiguous replication area



#### Looking up the Replication Area $\rightarrow$ High Latency

#### Taking inspiration from Hash-Tables



#### Taking inspiration from Hash-Tables



#### Hash-Table Lookup $\rightarrow$ Low Latency

#### A Set Associative Area (Like a Hash Table)



#### A Set Associative Area (Like a Hash Table)



Set-Associative Structure  $\rightarrow$  Low Latency

Set Associative: May not handle all faulty words



Set Associative: May not handle all faulty words



Set Associative: May not handle all faulty words





Set Associative: May not handle all faulty words



#### Set-Associative Structure $\rightarrow$ Can Overflow

Taking inspiration from Hash-Tables with Chaining



Taking inspiration from Hash-Tables with Chaining



#### Hash-Table with Chaining $\rightarrow$ Mitigates Overflows

#### HANDLE SKEWS IN THE REPLICATION AREA

#### **Provision Overflow Sets**



### HANDLE SKEWS IN THE REPLICATION AREA

#### **Provision Overflow Sets**



### Replication Area

#### Overflow Sets handle skews in the Replication Area

# **ARCHSHIELD: RESULTS**



# **ARCHSHIELD: RESULTS**



|                | Replication Area | Fault-Map |
|----------------|------------------|-----------|
| Area Overheads | 3.2%             | 0.8%      |

# **ARCHSHIELD: RESULTS**



|                | Replication Area | Fault-Map |
|----------------|------------------|-----------|
| Area Overheads | 3.2%             | 0.8%      |

Tolerates 200x higher BER with only 1% slowdown

**XED:** Exposing On-Die Error Detection Information for Strong Memory Reliability

## ISCA-2016

# **Prashant Nair** Vilas Sridharan Moinuddin Qureshi







# **ON-DIE ECC: MITIGATE SCALING FAULTS**

CHIP

Vendors plan to use "On-Die ECC"

- Fix scaling faults transparently
- Good DIMM with bad chips (yield)

CHIP

CHIP

CHIP

Part of new DDR standards

CHIP

CHIP



### **ON-DIE ECC: MITIGATE SCALING FAULTS**


### **ON-DIE ECC: MITIGATE SCALING FAULTS**



#### **On-Die ECC fixes scaling faults invisibly**

### **MITIGATING RUNTIME FAULTS**

Runtime faults

| Fault<br>Mode | Transient<br>Fault Rate<br>(FIT) | Permanent<br>Fault Rate (FIT) |
|---------------|----------------------------------|-------------------------------|
| Bit           | 14.2                             | 18.6                          |

#### ECC-DIMM (9-Chips)



## **MITIGATING RUNTIME FAULTS**

### Runtime faults

- Chip faults common
- Need strong ECC

| Fault<br>Mode | Transient<br>Fault Rate<br>(FIT) | Permanent<br>Fault Rate (FIT) |
|---------------|----------------------------------|-------------------------------|
| Bit           | 14.2                             | 18.6                          |
| Word          | 1.4                              | 0.3                           |
| Column        | 1.4                              | 5.6                           |
| Row           | 0.2                              | 8.2                           |
| Bank          | 0.8                              | 10                            |
| *Total        | 18                               | 42.7                          |



## **MITIGATING RUNTIME FAULTS**

### *Runtime chip faults* → Chipkill (strong ECC)

#### **18 DRAM Chips**



#### Cost: 18 Chips, Performance and Power Inefficient

#### GOAL

- Get "Chipkill-level" reliability with only 9 Chips
- Use On-Die ECC to enable Low-Cost Chipkill

## **XED: RE-PROVISION ON-DIE**



### **XED: RE-PROVISION ON-DIE**



#### On-Die ECC can detect chip-failures

Expose On-Die Error Detection  $\rightarrow$  Chipkill with 9 Chips



Expose On-Die Error Detection  $\rightarrow$  Chipkill with 9 Chips



Expose On-Die Error Detection  $\rightarrow$  Chipkill with 9 Chips



On-Die ECC detected it

Expose On-Die Error Detection  $\rightarrow$  Chipkill with 9 Chips



#### **OPTION 1: Use additional wires**



#### **OPTION 1: Use additional wires**



#### **OPTION 1: Use additional wires**





#### **OPTION 2: Use additional burst/transaction**



#### **OPTION 2: Use additional burst/transaction**











#### Expose On-Die error detection with minor changes

# **XED: ON-DIE ERROR INFORMATION FOR FREE**

#### On detecting an error, the DRAM chip sends a 64bit "Catch-Word" (CW) instead of data



#### Memory Controller

## **XED: ON-DIE ERROR INFORMATION FOR FREE**

#### On detecting an error, the DRAM chip sends a 64bit "Catch-Word" (CW) instead of data



# **XED: ON-DIE ERROR INFORMATION FOR FREE**

On detecting an error, the DRAM chip sends a 64bit "Catch-Word" (CW) instead of data

Chips provisioned with a unique Catch-Word

No additional wires/bandwidth overheads

Compatible with existing memory protocols

**Memory Controller** 

64-bit Catch-Words identify the faulty chip

#### Catch Word (CW) ≠ Valid Data (D2)



# Catch Word (CW) $\neq$ Valid Data (D2) Then $\rightarrow$ PA $\neq$ D0 $\bigoplus$ D1 $\bigoplus$ CW $\bigoplus$ ... $\bigoplus$ D7



# Catch Word (CW) $\neq$ Valid Data (D2) Then $\rightarrow$ PA $\neq$ D0 $\bigoplus$ D1 $\bigoplus$ CW $\bigoplus$ ... $\bigoplus$ D7



#### Catch Word (CW) = Valid Data (D2)



# Catch Word (CW) = Valid Data (D2) [*Collision*] Then $\rightarrow$ PA = D0 $\bigoplus$ D1 $\bigoplus$ CW $\bigoplus$ ... $\bigoplus$ D7



# Catch Word (CW) = Valid Data (D2) [*Collision*] Then $\rightarrow$ PA = D0 $\bigoplus$ D1 $\bigoplus$ CW $\bigoplus$ ... $\bigoplus$ D7



Catch-Word collision: Doesn't affect correctness

• A chip stores 64 bits/cache-line  $\rightarrow$  2<sup>64</sup> combinations

- A chip stores 64 bits/cache-line  $\rightarrow$  2<sup>64</sup> combinations
- Even a 16Gb chip has only 2<sup>28</sup> cachelines

- A chip stores 64 bits/cache-line  $\rightarrow$  2<sup>64</sup> combinations
- Even a 16Gb chip has only 2<sup>28</sup> cachelines
- Nearly 2<sup>63.99</sup> data combinations free!

- A chip stores 64 bits/cache-line  $\rightarrow$  2<sup>64</sup> combinations
- Even a 16Gb chip has only 2<sup>28</sup> cachelines
- Nearly 2<sup>63.99</sup> data combinations free!



- A chip stores 64 bits/cache-line  $\rightarrow$  2<sup>64</sup> combinations
- Even a 16Gb chip has only 2<sup>28</sup> cachelines
- Nearly 2<sup>63.99</sup> data combinations free!



The catch-word will most likely not collide

USIMM : 8 Cores, 4 Channels, 2 Ranks, 8 Banks

FaultSim\*: Memory Reliability Simulator

- Real World Fault Data
- 7 year system lifetime,
- Billion Monte-Carlo Trials
- Metric: Probability of System Failure
- Scaling Fault-Rate: 10<sup>-4</sup>

80

#### **RESULTS: RELIABILITY**

#### **XED vs Commercial ECC schemes**



#### **RESULTS: RELIABILITY**

#### **XED vs Commercial ECC schemes**



#### XED provides 172x higher reliability
#### **RESULTS: PERFORMANCE AND EDP**



#### XED enables Chipkill with a single DIMM Significant performance & power benefits

#### **SUMMARY**

• Hurdles for Moore's Law: Scaling & Runtime Faults

• Hurdles for Moore's Law: Scaling & Runtime Faults

• Current techniques are costly/ineffective

• Hurdles for Moore's Law: Scaling & Runtime Faults

• Current techniques are costly/ineffective

Low-cost architectural techniques can enable reliable and scalable memory systems  $\rightarrow$  Sustain Moore's Law

• Hurdles for Moore's Law: Scaling & Runtime Faults

• Current techniques are costly/ineffective

Low-cost architectural techniques can enable reliable and scalable memory systems  $\rightarrow$  Sustain Moore's Law

• **100-1000x** higher reliability with minimal overheads















#### **FUTURE RESEARCH VECTORS**

## **RESEARCH VECTOR: HYBRID MEMORY SYSTEM**

#### Reliability and Performance Optimizations



## **RESEARCH VECTOR: RELIABILITY + SECURITY**

Low-cost reliability for memory systems that implement security



## **RESEARCH VECTOR: OPTIMIZED IOT**

#### IoT optimizations by using codes to save power and provide security

















#### Unlock more benefits with cross-stack solutions

#### Enabling Practical and Scalable Quantum Computers



Quantum Substrate has a high error-rate

#### Enabling Practical and Scalable Quantum Computers



Quantum Substrate has a high error-rate

#### Enabling Practical and Scalable Quantum Computers



Quantum Substrate has a high error-rate

#### Enabling Practical and Scalable Quantum Computers



Quantum Substrate has a high error-rate

#### Enabling Practical and Scalable Quantum Computers



Temperature

1000x higher bandwidth overhead due to error correction

#### Enabling Practical and Scalable Quantum Computers



#### Enabling Practical and Scalable Quantum Computers



Temperature

#### Ways to delegate ECC near the quantum substrate

#### **COLLABORATORS**



# **THANK YOU**

Always welcome to reach out to me at pnair6@gatech.edu







# **THANK YOU**

Always welcome to reach out to me at pnair6@gatech.edu







# BACKUP

Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read

#### <u>HPCA-2015</u>

**Prashant Nair** Chia-Chen Chou Bipin Rajendran Moinuddin Qureshi







Performance

"Low-Latency Non-Volatile memories using Error Codes"
• Low (SET) and High (RESET) resistance states



• Low (SET) and High (RESET) resistance states



• Cell states are compared to reference resistance

• Low (SET) and High (RESET) resistance states



- Cell states are compared to reference resistance
- The states correspond to binary values of 0 and 1

• Low (SET) and High (RESET) resistance states



- Cell states are compared to reference resistance
- The states correspond to binary values of 0 and 1

# The read latency of PCM depends on value of $R_{\rm ref}$











Detect with Berger Code, Retry on error



Detect with Berger Code, Retry on error

Read latency  $30\% \downarrow \rightarrow$  Performance 26% 1

# **BIRTHDAY PROBLEM**



2-Chip Failures



Chipkill (18-chips)







2-Chip Failures



Chipkill







2-Chip Failures



Chipkill







### 2-Chip Failures $\rightarrow$ Extend to Multi-Chip Failures



Chipkill







#### SDC AND DUE RATE OF XED

| Source of Vulnerability            | Rate over 7 years                   |
|------------------------------------|-------------------------------------|
| XED: Scaling-Related Faults        | No SDC or DUE                       |
| XED: Row/ Column/ Bank Failure     | $1.4 \times 10^{-13}$ (SDC)         |
| XED: Word Failure                  | $6.1 \times 10^{-6}$ ( <b>DUE</b> ) |
| Data Loss from Multi-Chip Failures | $5.8 \times 10^{-4}$                |

# **ADDITIONAL BURST/TRANSACTION**



# **XED VS LOT-ECC**



SPEC PARSEC BIOBENCH COMM GMEAN