Our work is frequently published in top-tier venues such as ISCA, MICRO, HPCA, ASPLOS, DSN, SC, NeurIPS, and VLDB (typical acceptance rate of <20%). Our projects have been featured in top journals/letters such as EIR, ACM-TACO, and CAL.
A unified address space is vital for heterogeneous systems as it enables efficient data sharing between CPUs and GPUs. However, GPU address translation faces challenges due to high TLB pressure, particularly with irregular and memory-intensive applications. We observe that, compared to an ideal scenario, address translation overheads cause a slowdown of up to 34.5% in modern heterogeneous systems.
This paper introduces Avatar, a novel framework to accelerate address translation in GPUs. Avatar comprises two key components: Contiguity-Aware Speculative Translation (CAST) and In-Cache Validation (CAVA) mechanisms. Avatar identifies the potential for predicting virtual-to-physical address mapping by monitoring contiguous pages that lie in both virtual and physical address spaces. Leveraging this insight, CAST speculatively translates virtual addresses into physical addresses. This enables immediate data fetching into GPUs while addressing translation occurs in the background, reducing TLB-miss overhead.
Unfortunately, modern GPUs lack support for speculative execution, which limits CAST’s performance gain. Data fetched from speculated physical addresses is unusable until validation. CAVA addresses this limitation by quickly validating speculated physical addresses. To this end, CAVA embeds page mapping information into each 32B sector of 128B cache lines. Thus, CAVA enables fetching a sector block from memory for a speculated address and rapidly validating the speculative translation using the embedded mapping information. Our experiments show that Avatar achieves a high speculation accuracy of 90.3% and improves GPU performance by 37.2% (on average).
Quantum computers face challenges due to limited resources, particularly in cloud environments. Despite these obstacles, Variational Quantum Algorithms (VQAs) are considered promising applications for present-day Noisy Intermediate-Scale Quantum (NISQ) systems. VQAs require multiple optimization iterations to converge on a globally optimal solution. Moreover, these optimizations, known as restarts, need to be repeated from different points to mitigate the impact of noise. Unfortunately, the job scheduling policies for each VQA task in the cloud are heavily unoptimized. Notably, each VQA execution instance is typically scheduled on a single NISQ device. Given the variety of devices in the cloud, users often prefer higher-fidelity devices to ensure higher-quality solutions. However, this preference leads to increased queueing delays and unbalanced resource utilization.
We propose Qoncord, an automated job scheduling framework to address these cloud-centric challenges for VQAs. Qoncordleverages the insight that not all training iterations and restarts are equal, Qoncord strategically divides the training process into exploratory and fine-tuning phases. Early exploratory iterations, more resilient to noise, are executed on less busy machines, while fine-tuning occurs on high-fidelity machines. This adaptive approach mitigates the impact of noise and optimizes resource usage and queuing delays in cloud environments. Qoncord also significantly reduces execution time and minimizes restart overheads by eliminating low-performance iterations. Thus, Qoncord offers similar solutions 17.4× faster. Similarly, it can offer 13.3% better solutions for the same time budget as the baseline.
CLUSTER
Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience
Bo
Fang
, Xinyi
Li, Harvey
Dam, Cheng
Tan, Siva Kumar Sastry
Hari, Timothy
Tsai, Ignacio
Laguna, Dingwen
Tao, Ganesh
Gopalakrishnan, Prashant J.
Nair, Kevin
Barker, and Ang
Li
In the Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER) , 2024
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). Thus, one of the critical features of machine-learning-speciBc accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults adect the accuracy of machine learning applications. Based on the characteristics of error resilience, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy by 75% if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator’s pipelines.
Recommendation models utilize deep learning networks and massive embedding tables, making them compute- and memory-intensive. Typically, these models are trained in hybrid CPU-GPU or GPU-only mode. The hybrid mode uses GPU to accelerate the neural network while employing the CPUs to store and supply the memory-intensive embedding tables. However, this approach can result in significant CPU-to-GPU transfer time. In contrast, the GPU-only mode stores embedding tables in the High Bandwidth Memory (HBM) across multiple GPUs. However, this approach requires GPU-to-GPU backend communication. This increases communication overheads as the number of GPUs and the size of embedding entries increases.
This paper introduces a heterogeneous acceleration pipeline called Hotline to address these concerns. The Hotline accelerator uses the insight that only a few embedding entries are frequently accessed (popular) and develops a data-aware and model-aware scheduling pipeline. This pipeline utilizes CPU main memory for non-popular embeddings and GPUs’ HBM for popular embeddings. The Hotline accelerator fragments a single mini-batch into popular and non-popular micro-batches (μ-batches). The system then gathers the required working parameters for the non-popular μ-batches from the CPU while allowing GPUs to execute popular μ-batches. During training, the hardware accelerator dynamically stitches the execution of popular embeddings on GPUs and non-popular embeddings from the CPU’s main memory. Our real-world datasets and models studies show that Hotline reduces average training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline.
Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs.
This paper introduces “Keyformer”, an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as “key” tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer’s performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer’s reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model’s accuracy.
The Quantum Approximate Optimization Algorithm (QAOA) provides a quantum solution for combinatorial optimization problems. However, the optimal parameter searching process of QAOA is greatly affected by noise, leading to non-optimal solutions. This paper introduces a novel approach to optimize QAOA by exploiting the energy landscape concentration of similar instances via graph reduction, thus addressing the effect of noise. We formalize the notion of similar instances in QAOA and develop a Simulated Annealing-based graph reduction algorithm, called Red-QAOA, to identify the most similar subgraph for efficient parameter optimization. Red-QAOA outperforms state-of-the-art Graph Neural Network (GNN) based graph pooling techniques in performance and demonstrates effectiveness on a diverse set of real-world optimization problems encompassing 3200 graphs. Red-QAOA reduced the node counts and edge counts by 28% and 37%, respectively, while maintaining a low mean square error of 2%. These enable the identification of an optimal parameter set that is closer to the ideal true optimal solution in the presence of noise. By substantially streamlining the search for QAOA parameters, our approach sets the stage for the practical application of quantum algorithms in solving complex optimization problems.
2023
PACT
SparseFT: Sparsity-aware Fault Tolerance for Reliable CNN Inference on GPUs
Gwangeun
Byeon
, Seungtae
Lee
, Seongwook
Kim
, Yongjun
Kim, Prashant J.
Nair, and Seokin
Hong
In the Proceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) , Oct 2023
Graphics Processing Units (GPUs), while offering exceptional performance for CNN inference tasks, are susceptible to both transient and permanent hardware faults due to the integration of numerous processing elements and advancements in technology scaling. This paper proposes a novel and cost-effective fault mitigation technique, called Sparsity-aware Fault Tolerance (SparseFT), to ensure reliable CNN inference on GPUs. SparseFT leverages inherent sparsity in the activation maps to detect and correct errors on the processing elements without hardware redundancy. By exploiting the characteristic of dot-products, where multiplications with zero operands are ineffectual, SparseFT dynamically duplicates an effectual computation (i.e., a multiplication with non-zero operands) to the processing element initially assigned to the ineffectual one. It then compares the duplicated computation results to detect errors. Experimental results demonstrate that SparseFT achieves more than 97% error detection coverage with less than 1% performance overhead for the state-of-the-art CNN models.
Scalable Solid-State Drives (SSDs) have ushered in a transformative era in data storage and accessibility, spanning both data centers and portable devices. However, the strides made in scaling this technology can bear significant environmental consequences. On a global scale, a notable portion of semiconductor manufacturing relies on electricity derived from coal and natural gas sources. A striking example of this is the manufacturing process for a single Gigabyte of Flash memory, which emits approximately 0.16 Kg of CO2 - a considerable fraction of the total carbon emissions attributed to the system. Remarkably, the manufacturing of storage devices alone contributed to an estimated 20 million metric tonnes of CO2 emissions in the year 2021.In light of these environmental concerns, this paper delves into an analysis of the sustainability trade-offs inherent in Solid-State Drives (SSDs) when compared to traditional Hard Disk Drives (HDDs). Moreover, this study proposes methodologies to gauge the embodied carbon costs associated with storage systems effectively. The research encompasses four key strategies to enhance the sustainability of storage systems.Firstly, the paper offers insightful guidance for selecting the most suitable storage medium, be it SSDs or HDDs, considering the broader ecological impact. Secondly, the paper advocates for implementing techniques that extend the lifespan of SSDs, thereby mitigating premature replacements and their attendant environmental toll. Thirdly, the paper emphasizes the need for efficient recycling and reuse of high-density multi-level cell-based SSDs, underscoring the significance of minimizing electronic waste.Lastly, for handheld devices, the paper underscores the potential of harnessing the elasticity offered by cloud storage solutions as a means to curtail the ecological repercussions of localized data storage. In summation, this study critically addresses the embodied carbon issues associated with SSDs, comparing them with HDDs, and proposes a comprehensive framework of strategies to enhance the sustainability of storage systems.
Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, “straggler” devices with lower performance often dictate the overall training time in FL. In this work, we aim to alleviate this performance bottleneck due to stragglers by dynamically balancing the training load across the system. We introduce Invariant Dropout, a method that extracts a sub-model based on the weight update threshold, thereby minimizing potential impacts on accuracy. Building on this dropout technique, we develop an adaptive training framework, Federated Learning using Invariant Dropout (FLuID). FLuID offers a lightweight sub-model extraction to regulate computational intensity, thereby reducing the load on straggler devices without affecting model quality. Our method leverages neuron updates from non-straggler devices to construct a tailored sub-model for each straggler based on client performance profiling. Furthermore, FLuID can dynamically adapt to changes in stragglers as runtime conditions shift. We evaluate FLuID using five real-world mobile clients. The evaluations show that Invariant Dropout maintains baseline model efficiency while alleviating the performance bottleneck of stragglers through a dynamic, runtime approach.
QCCC
Efficient QAOA Optimization Using Directed Restarts and Graph Lookup
Variational Quantum Algorithms (VQA) aim to enhance the capabilities of Noisy Intermediate-Scale Quantum (NISQ) devices. These algorithms utilize parameterized circuits and classical optimizers to iteratively execute circuits with varying parameters. However, VQA faces computational overheads due to repeated iterations and random restarts. Prior work suggests using basic sub-graphs to transfer parameters for the input graph, reducing optimizer overheads but limiting applicability to structured regular graphs. In real-world applications, random irregular graphs are common, and existing methods are not scalable or practical for such graphs. This paper presents a framework that aims to improve random irregular graphs in VQA. The framework uses graph similarity and important features like total edge counts, average edge counts, and variance. It follows an iterative process to choose basis sub-graphs from a small database and adjust parameters accordingly. Classical optimizers then utilize these parameters to determine when to restart and perform gradient descent. This approach increases the chances of reaching global maximum points.
The advent of High-Performance Computing has led to the adoption of Convolutional Neural Networks (CNNs) in safety-critical applications such as autonomous vehicles. However, CNNs are vulnerable to DRAM errors corrupting their parameters, thereby degrading their accuracy. Existing techniques for protecting CNNs from DRAM errors are either expensive or fail to protect from large-granularity, multi-bit errors, which occur commonly in DRAMs.We propose a software-implemented coding scheme, Structural Coding (SC) for protecting CNNs from large-granularity memory errors. SC achieves three orders of magnitude reduction in Silent Data Corruption (SDC) rates of CNNs compared to no protection. Its average error correction coverage is also significantly higher than other software techniques to protect CNNs from faults in the memory. Further, its average performance, memory, and energy overheads are respectively 3%, 15.71%, and 4.38%. These overheads are much lower than other software protection techniques.
SC-W
Enabling Scalable VQE Simulation on Leading HPC Systems
Meng
Wang, Fei
Hua, Chenxu
Liu, Nicholas
Bauman, Karol
Kowalski, Daniel
Claudino, Travis
Humble, Prashant J.
Nair, and Ang
Li
In the Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis , Oct 2023
Large-scale simulations of quantum circuits pose significant challenges, especially in quantum chemistry, due to the number of qubits, circuit depth, and the number of circuits needed per problem. High-performance computing (HPC) systems offer massive computational capabilities that could help overcome these obstacles. We developed a high-performance quantum circuit simulator called NWQ-Sim, and demonstrated its capability to simulate large quantum chemistry problems on NERSC’s Perlmutter supercomputer. Integrating NWQ-Sim with XACC, an open-source programming framework for quantum-classical applications, we have executed quantum phase estimation (QPE) and variational quantum eigensolver (VQE) algorithms for downfolded quantum chemistry systems at unprecedented scales. Our work demonstrates the potential of leveraging HPC resources and optimized simulators to advance quantum chemistry and other applications of near-term quantum devices. By scaling to larger qubit counts and circuit depths, high-performance simulators like NWQ-Sim will be critical for characterizing and validating quantum algorithms before their deployment on actual quantum hardware.
As Dynamic Random Access Memories (DRAM) scale, they are becoming increasingly susceptible to Row Hammer. By rapidly activating rows of DRAM cells (aggressor rows), attackers can exploit inter-cell interference through Row Hammer to flip bits in neighboring rows (victim rows). A recent work, called Randomized Row-Swap (RRS), proposed proactively swapping aggressor rows with randomly selected rows before an aggressor row can cause Row Hammer. RRS promises several years of protection against Row Hammer attacks. Our paper observes that RRS is neither secure nor scalable. We first propose the ’Juggernaut attack pattern’ that breaks RRS in under 1 day. Juggernaut exploits the fact that the mitigative action of RRS, a swap operation, can itself induce additional target row activations, defeating such a defense. Second, this paper proposes a new defense Secure Row-Swap mechanism that avoids the additional activations from swap (and unswap) operations and protects against Juggernaut. Furthermore, this paper extends Secure Row-Swap with attack detection to defend against even future attacks. While this provides better security, it also allows reducing the frequency of swaps securely, thereby enabling Scalable and Secure Row-Swaps. Our analysis shows that the Scalable and Secure Row-Swap mechanism provides years of Row Hammer protection with 3.3x lower storage overheads as compared to the RRS design. Additionally, it incurs only a 0.7% slowdown as compared to a not-secure baseline for a Row Hammer threshold of 1.2K.
In the Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 2 , Oct 2023
Deep learning models are a valuable “secret sauce” that confers a significant competitive advantage. Many models are never visible to the user and even publicly known state-of-the-art models are either completely proprietary or only accessible via access-controlled APIs. Increasingly, these models run directly on the edge, often using a low-power DNN accelerator. This makes models particularly vulnerable, as an attacker with physical access can exploit side channels like off-chip memory access volumes. Indeed, prior work has shown that this channel can be used to steal dense DNNs from edge devices by correlating data transfer volumes with layer geometry. Unfortunately, prior techniques become intractable when the model is sparse in either weights or activations because off-chip transfers no longer correspond exactly to layer dimensions. Could it be that the many mobile-class sparse accelerators are inherently safe from this style of attack? In this paper, we show that it is feasible to steal a pruned DNN model architecture from a mobile-class sparse accelerator using the DRAM access volume channel. We describe HuffDuff, an attack scheme with two novel techniques that leverage (i) the boundary effect present in CONV layers, and (ii) the timing side channel of on-the-fly activation compression. Together, these techniques dramatically reduce the space of possible model architectures up to 94 orders of magnitude, resulting in fewer than 100 candidate models — a number that can be feasibly tested. Finally, we sample network instances from our solution space and show that (i) our solutions reach the victim accuracy under the iso-footprint constraint, and (ii) significantly improve black-box targeted attack success rates.
Rowhammer allows an attacker to induce bit flips in a row by rapidly accessing neighboring rows. Rowhammer is a severe security threat as it can be used to escalate privilege or break confidentiality. Moreover, the threshold of activations needed to induce Rowhammer continues to reduce and new attacks like Half-Double break existing solutions that refresh victim rows. The recently proposed Randomized Row-Swap (RRS) scheme is resilient to Half-Double as it provides mitigation by swapping an aggressor row with a random row. However, to ensure security, the threshold for triggering a row-swap must be set much lower than the Rowhammer threshold, leading to a significant performance loss of 20% on average, at a Rowhammer threshold of 1K. Furthermore, the SRAM overhead for storing the indirection table of RRS becomes prohibitively large – 2.4MB per rank at a Rowhammer threshold of 1K. Our goal is to develop a scalable Rowhammer mitigation that incurs negligible performance and storage overheads.
To this end, we propose AQUA, a Rowhammer mitigation that breaks the spatial correlation between aggressor and victim rows by dynamically quarantining the aggressor row in a dedicated region of memory. AQUA allows for an effective row migration threshold much higher than in RRS, leading to an order of magnitude less slowdown and SRAM. As the security of AQUA is not reliant on keeping the destination row a secret, we further reduce the SRAM overheads of the indirection table by storing it in DRAM, and accessing it on-demand. We derive the size of the quarantine region required to ensure security for AQUA and show that reserving about 1% of DRAM is sufficient to mitigate Rowhammer at a threshold of 1K. Our evaluations show that AQUA incurs an average slowdown of 2% and an SRAM overhead (for mapping and migration) of only 41KB per rank at a Rowhammer threshold of 1K.
DRAM systems continue to be plagued by the Row-Hammer (RH) security vulnerability. The threshold number of row activations (TRH) required to induce RH has reduced rapidly from 139K in 2014 to 4.8K in 2020, and TRH is expected to reduce further, making RH even more severe for future DRAM. Therefore, solutions for mitigating RH should be effective not only at current TRH but also at future TRH . In this paper, we investigate the mitigation of RH at ultra-low thresholds (500 and below). At such thresholds, state-of-the-art solutions, which rely on SRAM or CAM for tracking row activations, incur impractical storage overheads (340KB or more per rank at TRH of 500), making such solutions unappealing for commercial adoption. Alternative solutions, which store per-row metadata in the addressable DRAM space, incur significant slowdown (25% on average) due to extra memory accesses, even in the presence of metadata caches. Our goal is to develop scalable RH mitigation while incurring low SRAM and performance overheads.
To that end, this paper proposes Hydra, a Hybrid Tracker for RH mitigation, which combines the best of both SRAM and DRAM to enable low-cost mitigation of RH at ultra-low thresholds. Hydra consists of two structures. First, an SRAM-based structure that tracks aggregated counts at the granularity of a group of rows, and is sufficient for the vast majority of rows that receive only a few activations. Second, a per-row tracker stored in the DRAM-array, which can track an arbitrary number of rows, however, to limit performance overheads, this tracker is used only for the small number of rows that exceed the tracking capability of the SRAM-based structure. We provide a security analysis of Hydra to show that Hydra can reliably issue a mitigation within the specified threshold. Our evaluations show that Hydra enables robust mitigation of RH, while incurring an SRAM overhead of only 28 KB per-rank and an average slowdown of only 0.7% (at TRH of 500).
Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items’ and users’ categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.
Row-Hammer (RH) is a fault-injection attack that occurs when a DRAM row is accessed frequently causing bit-flips in nearby rows. RH is a serious security threat as attackers can cause bit-flips in page-tables to achieve privilege escalation. Several defenses have been recently proposed that track attacker-controlled aggressor-rows and apply mitigating action on immediate neighboring victim rows by refreshing them. However, all such proposals using victim-focused mitigation preserve the spatial connection between victim and aggressor rows, and are hence susceptible to access patterns causing bit-flips in rows beyond the immediate neighbor. For example, the recently disclosed Half-Double attack causes bit-flips in the presence of existing victim-focused mitigation. We propose Randomized Row-Swap (RRS), a novel mitigation action that breaks the spatial connection between the aggressor and victim rows, thus providing robust defense against even complex RH access patterns. RRS is an aggressor-focused mitigation that periodically swaps aggressor-rows with other randomly selected rows in memory, to limit the possible damage in any one locality. While RRS can be used with any tracking mechanism, we implement it with the recently proposed Misra-Gries tracker and target a RH Threshold of 4K activations (lower than recent attacks). Our evaluations show that RRS has negligible performance and power overheads (less than 0.5% on average) and provides strong security guarantees for avoiding RH bit flips throughout system lifetime.
Row-Hammer (RH) is a DRAM data-disturbance failure that occurs when a row is activated frequently, which causes bit-flips in nearby rows. Row-Hammer is a significant security threat as an attacker can exploit the bit-flips to do privilege escalation and leak confidential data. While several solutions aim to mitigate RH, such solutions depend on the RH threshold and adversarial access patterns. Solutions developed for a given threshold become ineffective for newer devices with lower thresholds, and new attack patterns continue to break existing mitigations. Currently, there is no guaranteed solution for RH, which means that the system remains vulnerable to security threats even in the presence of RH mitigation. In this paper, we contend that simply relying on RH mitigation is insufficient to provide security in the presence of reducing threshold and motivated attackers. We propose SafeGuard, which equips the system with low-cost integrity protection as a defense against potential attacks that break the RH mitigation. As SafeGuard can detect arbitrary failures, it converts the problem of RH bit-flips from a security threat (silent consumption of corrupted data) to a reliability concern (detectable uncorrectable errors caused by integrity violation). We develop SafeGuard for systems that employ ECC modules and show that SafeGuard can provide strong detection to both SECDED (46-bit MAC per cache-line) and Chipkill (32-bit MAC per cache-line) while retaining the correction capability of conventional designs. SafeGuard avoids incurring any storage overheads in DRAM by simply reorganizing the ECC code to operate at a cache-line granularity (64 bytes) instead of a word granularity (8 bytes). Our evaluations show that SafeGuard has a negligible impact on both the system performance (0.7%) and the system reliability due to naturally occurring errors while still providing a strong defense against the security risk of RH by detecting arbitrary bit-flips.
The popularity of Deep Neural Networks (DNNs) has led to many DNN accelerator architectures, which typically focus on the on-chip storage and computation costs. However, much of the energy is spent on accesses to off-chip DRAM memory. While emerging resistive memory technologies such as MRAM, PCM, and RRAM can potentially reduce this energy component, they suffer from drawbacks such as low endurance that prevent them from being a DRAM replacement in DNN applications. In this paper, we examine how DNN accelerators can be designed to overcome these limitations and how emerging memories can be used for off-chip storage. We demonstrate that through (a) careful mapping of DNN computation to the accelerator and (b) a hybrid setup (both DRAM and an emerging memory), we can reduce inference energy over a DRAM-only design by a factor ranging from 1.12× on EfficientNetB7 to 6.3× on ResNet-50, while also increasing the endurance from 2 weeks to over a decade. As the energy benefits vary dramatically across DNN models, we also develop a simple analytical heuristic solely based on DNN model parameters that predicts the suitability of a given DNN for emerging-memory-based accelerators.
In this paper, we identify a previously untapped source of compressibility in cache working sets: clusters of cachelines that are similar, but not identical, to one another. To compress the cache, we can then store the "clusteroid" of each cluster together with the (much smaller) "diffs" needed to reconstruct the rest of the cluster. To exploit this opportunity, we propose a hardware-level on-line cacheline clustering mechanism based on locality-sensitive hashing. Our method dynamically forms clusters as they appear in the data access stream and retires them as they disappear from the cache. Our evaluations show that we achieve 2.25x compression on average (and up to 9.9x) on SPEC-CPU 2017 suite and is significantly higher than prior proposals scaled to an iso-silicon budget.
The importance of caches for performance, and their high silicon area cost, have motivated hardware solutions that transparently compress the cached data to increase effective capacity without sacrificing silicon area. To this end, prior work has taken one of two approaches: either (a) deduplicating identical cache blocks across the cache to take advantage of inter-block redundancy or (b) compressing common patterns within each cache block to take advantage of intra-block redundancy.In this paper, we demonstrate that leveraging only one of these redundancy types leads to a significant loss in compression opportunities for several applications: some workloads exhibit either inter-block or intra-block redundancy, while others exhibit both. We propose 2DCC (Two Dimensional Cache Compression), a simple technique that takes advantage of both types of redundancy. Across the SPEC and Parsec benchmark suites, 2DCC results in a 2.12x compression factor (geomean) compared to 1.44–1.49x for best prior techniques on an iso-silicon basis. For the cache-sensitive subset of these benchmarks run in isolation, 2DCC also achieves a 11.7% speedup (geomean).
ICCD
ADAM: Adaptive Block Placement with Metadata Embedding for Hybrid Caches
Spin-Transfer Torque Random Access Memory (STT-RAM) is a potential alternative for SRAM-based on-chip caches. STT-RAM offers high density and low leakage power, thereby can be used to build a large capacity last-level caches (LLC). Unfortunately, the write latency of the STT-RAM is significantly longer, and its write energy is considerably higher compared to SRAM. To mitigate these concerns, researchers have proposed hybrid caches that are comprised of SRAM and STT-RAM regions. In such hybrid caches, an intelligent block placement policy is necessary to store as many write-intensive blocks in the SRAM region. This paper proposes an adaptive block placement framework with metadata embedding (ADAM) for hybrid caches. ADAM embeds metadata (i.e., write-intensity) into a cache block when it is evicted from LLC. When a cache block is brought from the main memory, metadata embedded in the block is extracted and used to determine the write-intensity of the block. Our evaluation shows that ADAM can improve performance by 26 % (on average) over a baseline block placement scheme.
Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict cache block placement to only include neighboring addresses. Ideally, we should be able to place compressed cache blocks without any restrictions or overheads.This paper proposes Touché, a framework for storing multiple compressed blocks from arbitrary addresses within a cacheline without tag area overheads. The Touché framework consists of three components. The first component, called the "Signature" (SIGN) engine, creates shortened signatures from the tag addresses of compressed blocks. Due to this, the SIGN engine can store multiple signatures in each tag entry. On a cache access, the physical cacheline is accessed only if there is a signature match (which has a negligible probability of false positive). The second component, called the "Tag Appended Data" (TADA) mechanism, stores the full tag addresses with data. TADA enables Touché to detect false positive signature matches by providing the full tag address. The third component, called the "Superblock Marker" (SMARK) mechanism, uses a unique marker in the tag entry to indicate compressed cache blocks from neighboring physical addresses in the same cacheline. Touché is hardware-based and achieves an average speedup of 12% (ideal 13%) when compared to an uncompressed baseline.
Existing and near-term quantum computers face significant reliability challenges because of high error rates caused by noise. Such machines are operated in the Noisy Intermediate Scale Quantum (NISQ) model of computing. As NISQ machines exhibit high error-rates, only programs that require a few qubits can be executed reliably. Therefore, NISQ machines tend to underutilize its resources. In this paper, we propose to improve the throughput and utilization of NISQ machines by using multi-programming and enabling the NISQ machine to concurrently execute multiple workloads.Multi-programming a NISQ machine is non-trivial. This is because, a multi-programmed NISQ machine can have an adverse impact on the reliability of the individual workloads. To enable multi-programming in a robust manner, we propose three solutions. First, we develop methods to partition the qubits into multiple reliable regions using error information from machine calibration so that each program can have a fair allocation of reliable qubits. Second, we observe that when two programs are of unequal lengths, measurement operations can impact the reliability of the co-running program. To reduce this interference, we propose a Delayed Instruction Scheduling (DIS) policy that delays the start of the shorter program so that all the measurement operations can be performed at the end. Third, we develop an Adaptive Multi-Programming (AMP) design that monitors the reliability at runtime and reverts to single program mode if the reliability impact of multi-programming is greater than a predefined threshold. Our evaluations with IBM-Q16 show that our proposals can improve resource utilization and throughput by up to 2x, while limiting the impact on reliability.
Conventionally, systems have relied on technology scaling to provide smaller cells, which helps in increasing the capacity of on-chip and off-chip structures. Unfortunately, scaling technology to smaller nodes causes increased susceptibility to faults. We study the problem of efficiently tolerating transient failures using scalable Spin-Transfer Torque RAM (STTRAM) as an example. At smaller feature sizes, the energy required to flip a STTRAM cell reduces, which makes these cells more susceptible to random failures caused by thermal noise. Such failures can be tolerated by periodic scrubbing and provisioning each line with Error Correction Code (ECC). However, to tolerate the desired bit-error-rate, the cache needs ECC-6 (six bit error correction) per line, incurring impractical storage overheads. Ideally, we want to tolerate these faults without relying on multi-bit ECC. We propose SuDoku, a design that provisions each line with ECC-1 and a strong error detection code, and relies on a region-based RAID-4 to perform correction of multi-bit errors. Unfortunately, simply having such a RAID-4 based architecture is ineffective at tolerating a high-rate of transient faults and provides an MTTF in the order of only a few seconds. We describe a novel data resurrection scheme that can repair multiple faulty lines in a RAID-4 region to increase the MTTF to several hours. We propose an extension of SuDoku, which hashes a given line into two regions of RAID-4 to significantly enhance reliability and increase the MTTF to trillions of hours. Our evaluations show that SuDoku provides 874x higher reliability than ECC-6, incurs 30% less storage than ECC-6, and performs within 0.1% of an ideal fault-free baseline.
Memory systems are becoming bandwidth constrained and data compression is seen as a simple technique to increase their effective bandwidth. However, data compression requires accessing Metadata which incurs additional bandwidth overheads. Even after using a Metadata-Cache, the bandwidth overheads of Metadata can reduce the benefits of compression.This paper proposes Attaché, a framework that reduces the overheads of Metadata accesses. The Attaché framework consists of two components. The first component, called the Blended Metadata Engine (BLEM), enables data and its Metadata to be accessed together. BLEM incurs additional Metadata accesses only 0.003% times and removes almost all Metadata bandwidth overheads. The second component, called the Compression Predictor (COPR), predicts if the memory block is compressed. The COPR predictor uses a fine-grained line-level predictor, a coarse-grained page-level predictor, and a global indicator. This enables Attaché to predict the compressibility of the memory block before sending a memory read request. We implement Attaché on a memory system that uses Sub-Ranking. On average, Attaché achieves 15.3% speedup (ideal 17%) and saves 22% energy consumption (ideal 23%) when compared to a baseline system that does not employ data compression. Attaché is completely hardware-based and uses only 368KB of SRAM.
Securing off-chip main memory is essential for protection from adversaries with physical access to systems. However, current secure-memory designs incur considerable performance overheads - a major cause being the multiple memory accesses required for traversing an integrity-tree, that provides protection against man-in-the-middle attacks or replay attacks. In this paper, we provide a scalable solution to this problem by proposing a compact integrity tree design that requires fewer memory accesses for its traversal. We enable this by proposing new storage-efficient representations for the counters used for encryption and integrity-tree in secure memories. Our Morphable Counters are more cacheable on-chip, as they provide more counters per cacheline than existing split counters. Additionally, they incur lower overheads due to counter-overflows, by dynamically switching between counter representations based on usage pattern. We show that using Morphable Counters enables a 128-ary integrity-tree, that can improve performance by 6.3% on average (up to 28.3%) and reduce system energy-delay product by 8.8% on average, compared to an aggressive baseline using split counters with a 64-ary integrity-tree. These benefits come without any additional storage or reduction in security and are derived from our compact counter representation, that reduces the integrity-tree size for a 16GB memory from 4MB in the baseline to 1MB. Compared to recently proposed VAULT, our design provides a speedup of 13.5% on average (up to 47.4%).
Building trusted data-centers requires resilient memories which are protected from both adversarial attacks and errors. Unfortunately, the state-of-the-art memory security solutions incur considerable performance overheads due to accesses for security metadata like Message Authentication Codes (MACs). At the same time, commercial secure memory solutions tend to be designed oblivious to the presence of memory reliability mechanisms (such as ECC-DIMMs), that provide tolerance to memory errors. Fortunately, ECC-DIMMs possess an additional chip for providing error correction codes (ECC), that is accessed in parallel with data, which can be harnessed for security optimizations. If we can re-purpose the ECC-chip to store some metadata useful for security and reliability, it can prove beneficial to both. To this end, this paper proposes Synergy, a reliability-security co-design that improves performance of secure execution while providing strong reliability for systems with 9-chip ECC-DIMMs. Synergy uses the insight that MACs being capable of detecting data tampering are also useful for detecting memory errors. Therefore, MACs are best suited for being placed inside the ECC chip, to be accessed in parallel with each data access. By co-locating MAC and Data, Synergy is able to avoid a separate memory access for MAC and thereby reduce the overall memory traffic for secure memory systems. Furthermore, Synergy is able to tolerate 1 chip failure out of 9 chips by using a parity that is constructed over 9 chips (8 Data and 1 MAC), which is used for reconstructing the data of the failed chip. For memory intensive workloads, Synergy provides a speedup of 20% and reduces system Energy Delay Product by 31% compared to a secure memory baseline with ECC-DIMMs. At the same time, Synergy increases reliability by 185x compared to ECC-DIMMs that provide Single-Error Correction, Double-Error Detection (SECDED) capability. Synergy uses commercial ECC-DIMMs and does not incur any additional hardware overheads or reduction of security.
arXiv
LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement and Low Latency
This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work’s significance and future potential. Contemporary systems perform bulk data movement movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel results in high latency and energy consumption. Prior work proposes to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. Our HPCA 2016 paper proposes a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling a variety of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA’s three applications significantly improves performance and memory energy efficiency on a variety of workloads and system configurations.
2017
Thesis
Architectural Techniques to Enable Reliable and Scalable Memory Systems
Prashant J.
Nair
Ph.D. Thesis (arXiv preprint:1704.03991), Feb 2017
High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen as the key impediment towards using high-density devices, adopting new technologies, and even building the next Exascale supercomputer. To ensure even a bare-minimum level of reliability, present-day solutions tend to have high performance, power and area overheads. Ideally, we would like memory systems to remain robust, scalable, and implementable while keeping the overheads to a minimum. This dissertation describes how simple cross-layer architectural techniques can provide orders of magnitude higher reliability and enable seamless scalability for memory systems while incurring negligible overheads.
This paper investigates compression for DRAM caches. As the capacity of DRAM cache is typically large, prior techniques on cache compression, which solely focus on improving cache capacity, provide only a marginal benefit. We show that more performance benefit can be obtained if the compression of the DRAM cache is tailored to provide higher bandwidth. If a DRAM cache can provide two compressed lines in a single access, and both lines are useful, the effective bandwidth of the DRAM cache would double. Unfortunately, it is not straight-forward to compress DRAM caches for bandwidth. The typically used Traditional Set Indexing (TSI) maps consecutive lines to consecutive sets, so the multiple compressed lines obtained from the set are from spatially distant locations and unlikely to be used within a short period of each other. We can change the indexing of the cache to place consecutive lines in the same set to improve bandwidth; however, when the data is incompressible, such spatial indexing reduces effective capacity and causes significant slowdown.Ideally, we would like to have spatial indexing when the data is compressible and TSI otherwise. To this end, we propose Dynamic-Indexing Cache comprEssion (DICE), a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data. We also propose low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line. Our studies with a 1GB DRAM cache, on a wide range of workloads (including SPEC and Graph), show that DICE improves performance by 19.0% and reduces energy-delay-product by 36% on average. DICE is within 3% of a design that has double the capacity and double the bandwidth. DICE incurs a storage overhead of less than 1KB and does not rely on any OS support.
A quantum computer consists of quantum bits (qubits) and a control processor that acts as an interface between the programmer and the qubits. As qubits are very sensitive to noise, they rely on continuous error correction to maintain the correct state. Current proposals rely on software-managed error correction and require large instruction bandwidth, which must scale in proportion to the number of qubits. While such a design may be reasonable for small-scale quantum computers, we show that instruction bandwidth tends to become a critical bottleneck for scaling quantum computers.In this paper, we show that 99.999% of the instructions in the instruction stream of a typical quantum workload stem from error correction. Using this observation, we propose QuEST (Quantum Error-Correction Substrate), an architecture that delegates the task of quantum error correction to the hardware. QuEST uses a dedicated programmable micro-coded engine to continuously replay the instruction stream associated with error correction. The instruction bandwidth requirement of QuEST scales in proportion to the number of active qubits (typically << 0.1%) rather than the total number of qubits. We analyze the effectiveness of QuEST with area and thermal constraints and propose a scalable microarchitecture using typical Quantum Error Correction Code (QECC) execution patterns. Our evaluations show that QuEST reduces instruction bandwidth demand of several key workloads by five orders of magnitude while ensuring deterministic instruction delivery. Apart from error correction, we also observe a large instruction bandwidth requirement for fault tolerant quantum instructions (magic state distillation). We extend QuEST to manage these instructions in hardware and provide additional reduction in bandwidth. With QuEST, we reduce the total instruction bandwidth by eight orders of magnitude.
Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller.This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose eXposed On-Die Error Detection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined "catch-word" instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172x higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.
TACO
Citadel: Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures
Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (e.g., due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory-level parallelism, causing significant slowdown and higher power consumption.This article proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank. By retaining cache lines within banks, Citadel enables a high-performance and low-power memory system and also efficiently protects the stacked memory system from large-granularity failures. Citadel consists of three components; TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs; Tri-Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures; and Dynamic Dual-Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700 \texttimes higher reliability than ChipKill-like ECC codes.
This paper introduces a new DRAM design that enables fast and energy-efficient bulk data movement across subarrays in a DRAM chip. While bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel for bulk data movement results in high latency and energy consumption. Prior work proposed to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. We propose a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling an array of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA’s three applications significantly improves performance and memory energy efficiency, and their combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.
Phase Change Memory (PCM) is an emerging Non Volatile Memory (NVM) technology that has the potential to provide scalable high-density memory systems. While the non-volatility of PCM is a desirable property in order to save leakage power, it also has the undesirable effect of making PCM main memories susceptible to newer modes of security vulnerabilities, for example, accessibility to sensitive data if a PCM DIMM gets stolen. PCM memories can be made secure by encrypting the data. Unfortunately, such encryption comes with a significant overhead in terms of bits written to PCM memory, causing half of the bits in the line to change on every write, even if the actual number of bits being written to memory is small. Our studies show that a typical writeback modifies, on average, only 12% of the bits in the cacheline. Thus, encryption causes almost a 4x increase in the number of bits written to PCM memories. Such extraneous bit writes cause significant increase in write power, reduction in write endurance, and reduction in write bandwidth. To provide the benefit of secure memory in a write efficient manner this paper proposes Dual Counter Encryption (DEUCE). DEUCE is based on the observation that a typical writeback only changes a few words, so DEUCE reencrypts only the words that have changed. We show that DEUCE reduces the number of modified bits per writeback for a secure memory from 50% to 24%, which improves performance by 27% and increases lifetime by 2x.
Multirate refresh techniques exploit the non-uniformity in retention times of DRAM cells to reduce the DRAM refresh overheads. Such techniques rely on accurate profiling of retention times of cells, and perform faster refresh only for a few rows which have cells with low retention times. Unfortunately, retention times of some cells can change at runtime due to Variable Retention Time (VRT), which makes it impractical to reliably deploy multirate refresh. Based on experimental data from 24 DRAM chips, we develop architecture-level models for analyzing the impact of VRT. We show that simply relying on ECC DIMMs to correct VRT failures is unusable as it causes a data error once every few months. We propose AVATAR, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures. AVATAR provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.
TACO
FaultSim: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems
As memory systems scale, maintaining their Reliability Availability and Serviceability (RAS) is becoming more complex. To make matters worse, recent studies of DRAM failures in data centers and supercomputer environments have highlighted that large-granularity failures are common in DRAM chips. Furthermore, the move toward 3D-stacked memories can make the system vulnerable to newer failure modes, such as those occurring from faults in Through-Silicon Vias (TSVs). To architect future systems and to use emerging technology, system designers will need to employ strong error correction and repair techniques. Unfortunately, evaluating the relative effectiveness of these reliability mechanisms is often difficult and is traditionally done with analytical models, which are both error prone and time-consuming to develop. To this end, this article proposes FaultSim, a fast configurable memory-reliability simulation tool for 2D and 3D-stacked memory systems. FaultSim employs Monte Carlo simulations, which are driven by real-world failure statistics. We discuss the novel algorithms and data structures used in FaultSim to accelerate the evaluation of different resilience schemes. We implement BCH-1 (SECDED) and ChipKill codes using FaultSim and validate against an analytical model. FaultSim implements BCH-1 and ChipKill codes with a deviation of only 0.032% and 8.41% from the analytical model. FaultSim can simulate 1 million Monte Carlo trials (each for a period of 7 years) of BCH-1 and ChipKill codes in only 34 seconds and 33 seconds, respectively.
Phase Change Memory (PCM) is an emerging memory technology that can enable scalable high-density main memory systems. Unfortunately, PCM has higher read latency than DRAM, resulting in lower system performance. This paper investigates architectural techniques to improve the read latency of PCM. We observe that there is a wide distribution in cell resistance in both the SET state and the RESET state, and that the read latency of PCM is designed conservatively to handle the worst case cell. If PCM sensing can be tuned to exploit the variability in cell resistance, then we can get reduced read latency. We propose two schemes to enable better-than-worst-case read latency for PCM systems. Our first proposal, Early Read, reads the data earlier than the specified time period. Our key observation that Early Read causes only unidirectional errors (SET being read as RESET) allows us to efficiently detect data errors using Berger codes. In the uncommon case that Early Read causes data error(s), we simply retry the read operation with original latency. Our evaluations show that Early Read can reduce the read latency by 25% while incurring a storage overhead of only 10 bits per 64 byte line. Our second proposal, Turbo Read, reduces the sensing time for read operations by pumping higher current, at the expense of accidentally switching the PCM cell with small probability during the read operation. We analyze Error Correction Codes (ECC) and Probabilistic Row Scrubbing (PRS) for maintaining data integrity under Turbo Read. We show that a combination of Early Read and Turbo Read can reduce the PCM read latency by 30%, improve the system performance by 21%, and reduce the Energy Delay Product (EDP) by 28%, while requiring minimal changes to the memory system.
Energy consumption is a primary consideration that determines the usability of emerging mobile computing devices such as smartphones. Refresh operations for main memory account for a significant fraction of the overall energy consumption, especially during idle periods, when processor can be switched off quickly, however, memory contents continue to get refreshed to avoid data loss. Given that mobile devices are idle most of the times, reducing refresh power in idle mode is critical to maximize the duration for which the device remains usable. The frequency of refresh operations in memory can be reduced significantly by using strong multi-bit error correction codes (ECC). Unfortunately, strong ECC codes incur high latency, which causes significant performance degradation (as high as 21%, and on average 10%). To obtain both low refresh power in idle periods and high performance in active periods, this paper proposes Morphable ECC (MECC). During idle periods, MECC keeps the memory protected with 6-bit ECC (ECC-6) and employs a refresh period of 1 second, instead of the typical refresh period of 64ms. During active operation, MECC reduces the refresh interval to 64ms, and converts memory from ECC-6 to weaker ECC (single-bit error correction) on a demand-basis, thus avoiding the high latency of ECC-6, except for the first access during the active mode. Our proposal reduces refresh operations during idle mode by 16x, memory power in idle mode by 2X, while retaining performance within 2% of a system that does not use any ECC.
Dynamic Random Access Memory (DRAM) cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read operations, which increases read latency and reduces system performance. We show that eliminating latency penalty due to refresh can improve average performance by 7.2%. However, simply doing intelligent scheduling of refresh operations is ineffective at obtaining significant performance improvement.This article provides an alternative and scalable option to reduce the latency penalty due to refresh. It exploits the property that each refresh operation in a typical DRAM device internally refreshes multiple DRAM rows in JEDEC-based distributed refresh mode. Therefore, a refresh operation has well-defined points at which it can potentially be Paused to service a pending read request. Leveraging this property, we propose Refresh Pausing, a solution that is highly effective at alleviating the contention from refresh operations. It provides an average performance improvement of 5.1% for 8Gb devices and becomes even more effective for future high-density technologies. We also show that Refresh Pausing significantly outperforms the recently proposed Elastic Refresh scheme.
CAL
Architectural Support for Mitigating Row Hammering in DRAM Memories
DRAM scaling has been the prime driver of increasing capacity of main memory systems. Unfortunately, lower technology nodes worsen the cell reliability as it increases the coupling between adjacent DRAM cells, thereby exacerbating different failure modes. This paper investigates the reliability problem due to Row Hammering, whereby frequent activations of a given row can cause data loss for its neighboring rows. As DRAM scales to lower technology nodes, the threshold for the number of row activations that causes data loss for the neighboring rows reduces, making Row Hammering a challenging problem for future DRAM chips. To overcome Row Hammering, we propose two architectural solutions: First, Counter-Based Row Activation (CRA), which uses a counter with each row to count the number of row activations. If the count exceeds the row hammering threshold, a dummy activation is sent to neighboring rows proactively to refresh the data. Second, Probabilistic Row Activation (PRA), which obviates storage overhead of tracking and simply allows the memory controller to proactively issue dummy activations to neighboring rows with a small probability for all memory access. Our evaluations show that these solutions are effective at mitigating Row hammering while causing negligible performance loss (less than 1%).
Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (for example, due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory level parallelism causing significant slowdown and higher power consumption. This paper proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank, thus allowing high performance, lower power and efficiently protects the stacked memory from large-granularity failures. Citadel consists of three components, TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs, Tri Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures, and Dynamic Dual Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700x higher reliability than Chip Kill-like ECC codes.
DRAM scaling has been the prime driver for increasing the capacity of main memory system over the past three decades. Unfortunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller geometries, coupled with the problems of device variation and leakage. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, existing techniques such as row/column sparing and ECC become prohibitive at high error-rates.To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are integrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.
DRAM cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read operations, which increases read latency and reduces system performance. We show that eliminating latency penalty due to refresh can improve average performance by 7.2%. However, simply doing intelligent scheduling of refresh operations is ineffective at obtaining significant performance improvement. This paper provides an alternative and scalable option to reduce the latency penalty due to refresh. It exploits the property that each refresh operation in a typical DRAM device internally refreshes multiple DRAM rows in JEDEC-based distributed refresh mode. Therefore, a refresh operation has well defined points at which it can potentially be Paused to service a pending read request. Leveraging this property, we propose Refresh Pausing, a solution that is highly effective at alleviating the contention from refresh operations. It provides an average performance improvement of 5.1% for 8Gb devices, and becomes even more effective for future high-density technologies. We also show that Refresh Pausing significantly outperforms the recently proposed Elastic Refresh scheme.