東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Fault-tolerant cluster management.

Li, Ming.

FindBook

Google Book

Amazon

博客來

Fault-tolerant cluster management.

紀錄類型:	書目-語言資料,印刷品 : Monograph/item
正題名/作者:	Fault-tolerant cluster management./
作者:	Li, Ming.
面頁冊數:	208 p.
附註:	Adviser: Yuval Tamir.
Contained By:	Dissertation Abstracts International67-07B.
標題:	Computer Science. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3226042
ISBN:	9780542796708

Fault-tolerant cluster management.
Li, Ming.

Fault-tolerant cluster management. - 208 p.

Adviser: Yuval Tamir.

Thesis (Ph.D.)--University of California, Los Angeles, 2006.

Cost-effective high-performance can be achieved using clusters of Commercial Off-The-Shelf (COTS) computers interconnected by high-speed networks. When clusters are used for critical applications and/or in hostile environment, the required system reliability can only be achieved using fault tolerance techniques that allow the system to continue to operate correctly despite component failure. Cluster management middleware (CMM) is a software layer above the operating system controlling individual nodes and below the applications. The CMM schedules tasks on a cluster, controls access to shared resources, provides for task submission and monitoring, and coordinates the cluster's fault tolerance mechanisms. Reliable operation of the cluster requires reliable, continuous operation of the management middleware.

ISBN: 9780542796708Subjects--Topical Terms:

626642
Computer Science.

Fault-tolerant cluster management.
LDR:03185nam 2200289 a 45 001 974322
005 20110929
008 110929s2006 eng d
020 $a 9780542796708
035 $a (UMI)AAI3226042
035 $a AAI3226042
040 $a UMI $c UMI
100 1 $a Li, Ming. $3 559294
245 1 0 $a Fault-tolerant cluster management.
300 $a 208 p.
500 $a Adviser: Yuval Tamir.
500 $a Source: Dissertation Abstracts International, Volume: 67-07, Section: B, page: 3906.
502 $a Thesis (Ph.D.)--University of California, Los Angeles, 2006.
520 $a Cost-effective high-performance can be achieved using clusters of Commercial Off-The-Shelf (COTS) computers interconnected by high-speed networks. When clusters are used for critical applications and/or in hostile environment, the required system reliability can only be achieved using fault tolerance techniques that allow the system to continue to operate correctly despite component failure. Cluster management middleware (CMM) is a software layer above the operating system controlling individual nodes and below the applications. The CMM schedules tasks on a cluster, controls access to shared resources, provides for task submission and monitoring, and coordinates the cluster's fault tolerance mechanisms. Reliable operation of the cluster requires reliable, continuous operation of the management middleware.
520 $a This dissertation is focused on the key challenges in building highly reliable CMM. The system is based on centralized decision making. However, unlike most other cluster middleware, the manager is protected by Byzantine fault-tolerant state machine replication and the ability to restore the management service to full functionality and full fault tolerance following arbitrary single faults. To this end, we use a low-cost fault-tolerant replication mechanism coupled with on-line self-diagnosis and reconfiguration. The robust replicated manager is coupled with less aggressive fault tolerance mechanisms for dealing with less critical system components and with a fault-tolerant system bootstrapping mechanism. A fault-tolerant cluster designed to operate autonomously, must include a highly-reliable trusted hardcore to control critical functions such as the initiation of a node reset. We describe the functionality required from this trusted hardcore and its interactions with the replicated cluster manager.
520 $a The result of this work is a carefully balanced integrated set of efficient practical techniques for aggressive fault tolerance. These techniques allow a highly reliable system to be built using mostly standard COTS hardware and software components. This is demonstrated in an operational system, called Ghidrah, that has been built at UCLA. This dissertation includes preliminary performance evaluation of Ghidrah and validation of the fault tolerance mechanisms by fault injection experiments.
590 $a School code: 0031.
650 4 $a Computer Science. $3 626642
690 $a 0984
710 2 0 $a University of California, Los Angeles. $3 626622
773 0 $t Dissertation Abstracts International $g 67-07B.
790 $a 0031
790 1 0 $a Tamir, Yuval, $e advisor
791 $a Ph.D.
792 $a 2006
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3226042