東華大學圖書館 |

Resilient Deep Learning Accelerators.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Resilient Deep Learning Accelerators./
作者:	He, Yi.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2023,
面頁冊數:	151 p.
附註:	Source: Dissertations Abstracts International, Volume: 84-12, Section: B.
Contained By:	Dissertations Abstracts International84-12B.
標題:	Computer science. -
電子資源:	https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30486723
ISBN:	9798379704513

Resilient Deep Learning Accelerators.
He, Yi.

Resilient Deep Learning Accelerators. - Ann Arbor : ProQuest Dissertations & Theses, 2023 - 151 p.

Source: Dissertations Abstracts International, Volume: 84-12, Section: B.

Thesis (Ph.D.)--The University of Chicago, 2023.

This item must not be sold to any third party vendors.

Deep learning (DL) accelerators have been widely deployed in a broad range of application domains, ranging from edge computing, self-driving cars, to cloud services. Resilience against hardware failures is a top priority for these accelerators as hardware failures can lead to various undesirable consequences. For inference workloads, hardware failures can generate crashes and significant degradation in inference accuracy. For training workloads, hardware failures can result in failure to converge, low training/test accuracy, numerical errors (e.g., INF/NaNs), and crashes.In this thesis, we establish the new knowledge on how major classes of hardware failures propagate and affect deep neural network (DNN) inference and training workloads, and advance state-of-art resilient DL systems by devising lightweight, effective detection and recovery techniques to mitigate hardware failures. There are three main contributions in this thesis.First, we devise a novel technique for generating high-quality test programs to detect permanent hardware failures (i.e., hardware failures that impose persistent effects, e.g., early-life failures, circuit aging) in the field for all hardware units in a DL inference accelerator. For hardware units that perform computations, we take advantage of the architectural characteristics of a given DL accelerator to first adopt the ATPG (Automatic Test Pattern Generation) technology that is routinely used in manufacturing testing to generate high-quality test patterns, and then reverse-engineer the dataflow/reuse algorithm of the accelerator to map the test inputs to equivalent DNNs that can be directly executed on the accelerator. For hardware units that are responsible for controlling data movement and computation, our key observation is that typically only one or a few fixed DNNs are deployed at a time, which allows us to target only the hardware failures that can directly affect these DNNs by executing different layers of the DNNs using carefully-crafted weight and input tensors. We demonstrate the efficacy of our technique using Nvidia's open-source accelerator, NVDLA, and show that our technique successfully detects > 99.9% of stuck-at faults and > 99.0% of transition faults (two representative fault models for permanent hardware failures) for the entire NVDLA design, compared to only < 80% if random test programs are used. Moreover, our technique requires minimal test time and test storage and no hardware modification.Second, we provide (1) an open-source resilience analysis framework targeting transient hardware failures (i.e., hardware failures that impose temporary effects, e.g., soft errors, dynamic variations) in a DL inference accelerator, called FIdelity, as well as (2) the first in-depth study, using this framework, on the impacts of transient hardware failures on all hardware units of a DL inference accelerator through fault injection experiments. FIdelity enables accurate and quick resilience analysis by modeling hardware failures in software with high fidelity. This is achieved by taking account of both the spatial and temporal reuse effects of hardware signals to map the effects of a given faulty hardware signal to a set of faulty output neurons in the current DNN layer, for which the corresponding faulty values can be obtained through software simulation. Moreover, these accurate software fault models can be obtained using minimal hardware information, without requiring access to the detailed hardware implementation (e.g. RTL, which may not be easily attainable). We implement and thoroughly validate the FIdelity framework using NVDLA as the baseline DL accelerator. Using FIdelity, we perform 46M fault injection experiments running various representative deep neural network inference workloads. We thoroughly analyze the experiment results, fundamentally understand how the injected faults propagate, and obtain several new insights that can be used in designing efficient, resilient DL inference accelerators.Lastly, we focus on DL training workloads, and provide (1) the first study that reveals the fundamental understanding on how both transient and permanent hardware failures affect DL training workloads, and (2) new, light-weight hardware failure mitigation techniques. We extend the FIdelity framework to perform large-scale fault injection experiments targeting DL training workloads, and conduct > 2.9M experiments using a diverse set of DL training workloads. We characterize the outcomes of these experiments, thoroughly analyze the fault propagation paths, and derive the necessary conditions that must be satisfied for hardware failures to eventually cause unexpected training outcomes. Based on the necessary conditions, we develop ultra-lightweight software techniques to detect hardware failures and recover the workloads, which only require 24-32 lines of code change, and introduce 0.003 − 0.025% performance overhead for various representative neural networks, which is 25x-500x more efficient than checkpointing techniques.

ISBN: 9798379704513Subjects--Topical Terms:

523869
Computer science.
Subjects--Index Terms:

Computer architecture

Resilient Deep Learning Accelerators.
LDR:06322nmm a2200421 4500 001 2393022
005 20231130111618.5
006 m o d
007 cr#unu||||||||
008 251215s2023 ||||||||||||||||| ||eng d
020 $a 9798379704513
035 $a (MiAaPQ)AAI30486723
035 $a AAI30486723
040 $a MiAaPQ $c MiAaPQ
100 1 $a He, Yi. $3 1918543
245 1 0 $a Resilient Deep Learning Accelerators.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2023
300 $a 151 p.
500 $a Source: Dissertations Abstracts International, Volume: 84-12, Section: B.
500 $a Advisor: Li, Yanjing.
502 $a Thesis (Ph.D.)--The University of Chicago, 2023.
506 $a This item must not be sold to any third party vendors.
520 $a Deep learning (DL) accelerators have been widely deployed in a broad range of application domains, ranging from edge computing, self-driving cars, to cloud services. Resilience against hardware failures is a top priority for these accelerators as hardware failures can lead to various undesirable consequences. For inference workloads, hardware failures can generate crashes and significant degradation in inference accuracy. For training workloads, hardware failures can result in failure to converge, low training/test accuracy, numerical errors (e.g., INF/NaNs), and crashes.In this thesis, we establish the new knowledge on how major classes of hardware failures propagate and affect deep neural network (DNN) inference and training workloads, and advance state-of-art resilient DL systems by devising lightweight, effective detection and recovery techniques to mitigate hardware failures. There are three main contributions in this thesis.First, we devise a novel technique for generating high-quality test programs to detect permanent hardware failures (i.e., hardware failures that impose persistent effects, e.g., early-life failures, circuit aging) in the field for all hardware units in a DL inference accelerator. For hardware units that perform computations, we take advantage of the architectural characteristics of a given DL accelerator to first adopt the ATPG (Automatic Test Pattern Generation) technology that is routinely used in manufacturing testing to generate high-quality test patterns, and then reverse-engineer the dataflow/reuse algorithm of the accelerator to map the test inputs to equivalent DNNs that can be directly executed on the accelerator. For hardware units that are responsible for controlling data movement and computation, our key observation is that typically only one or a few fixed DNNs are deployed at a time, which allows us to target only the hardware failures that can directly affect these DNNs by executing different layers of the DNNs using carefully-crafted weight and input tensors. We demonstrate the efficacy of our technique using Nvidia's open-source accelerator, NVDLA, and show that our technique successfully detects > 99.9% of stuck-at faults and > 99.0% of transition faults (two representative fault models for permanent hardware failures) for the entire NVDLA design, compared to only < 80% if random test programs are used. Moreover, our technique requires minimal test time and test storage and no hardware modification.Second, we provide (1) an open-source resilience analysis framework targeting transient hardware failures (i.e., hardware failures that impose temporary effects, e.g., soft errors, dynamic variations) in a DL inference accelerator, called FIdelity, as well as (2) the first in-depth study, using this framework, on the impacts of transient hardware failures on all hardware units of a DL inference accelerator through fault injection experiments. FIdelity enables accurate and quick resilience analysis by modeling hardware failures in software with high fidelity. This is achieved by taking account of both the spatial and temporal reuse effects of hardware signals to map the effects of a given faulty hardware signal to a set of faulty output neurons in the current DNN layer, for which the corresponding faulty values can be obtained through software simulation. Moreover, these accurate software fault models can be obtained using minimal hardware information, without requiring access to the detailed hardware implementation (e.g. RTL, which may not be easily attainable). We implement and thoroughly validate the FIdelity framework using NVDLA as the baseline DL accelerator. Using FIdelity, we perform 46M fault injection experiments running various representative deep neural network inference workloads. We thoroughly analyze the experiment results, fundamentally understand how the injected faults propagate, and obtain several new insights that can be used in designing efficient, resilient DL inference accelerators.Lastly, we focus on DL training workloads, and provide (1) the first study that reveals the fundamental understanding on how both transient and permanent hardware failures affect DL training workloads, and (2) new, light-weight hardware failure mitigation techniques. We extend the FIdelity framework to perform large-scale fault injection experiments targeting DL training workloads, and conduct > 2.9M experiments using a diverse set of DL training workloads. We characterize the outcomes of these experiments, thoroughly analyze the fault propagation paths, and derive the necessary conditions that must be satisfied for hardware failures to eventually cause unexpected training outcomes. Based on the necessary conditions, we develop ultra-lightweight software techniques to detect hardware failures and recover the workloads, which only require 24-32 lines of code change, and introduce 0.003 − 0.025% performance overhead for various representative neural networks, which is 25x-500x more efficient than checkpointing techniques.
590 $a School code: 0330.
650 4 $a Computer science. $3 523869
650 4 $a Computer engineering. $3 621879
650 4 $a Information technology. $3 532993
653 $a Computer architecture
653 $a Deep learning accelerator
653 $a Deep neural networks
653 $a Hardware reliability
653 $a Hardware resilience
653 $a FIdelity
653 $a DL training workloads
690 $a 0984
690 $a 0489
690 $a 0464
710 2 $a The University of Chicago. $b Computer Science. $3 1674744
773 0 $t Dissertations Abstracts International $g 84-12B.
790 $a 0330
791 $a Ph.D.
792 $a 2023
793 $a English
856 4 0 $u https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30486723