語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Multi-Armed Bandits in Large-Scale C...
~
Xu, Xiao.
FindBook
Google Book
Amazon
博客來
Multi-Armed Bandits in Large-Scale Complex Systems.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Multi-Armed Bandits in Large-Scale Complex Systems./
作者:
Xu, Xiao.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, : 2020,
面頁冊數:
176 p.
附註:
Source: Dissertations Abstracts International, Volume: 81-12, Section: B.
Contained By:
Dissertations Abstracts International81-12B.
標題:
Electrical engineering. -
電子資源:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27962096
ISBN:
9781083470713
Multi-Armed Bandits in Large-Scale Complex Systems.
Xu, Xiao.
Multi-Armed Bandits in Large-Scale Complex Systems.
- Ann Arbor : ProQuest Dissertations & Theses, 2020 - 176 p.
Source: Dissertations Abstracts International, Volume: 81-12, Section: B.
Thesis (Ph.D.)--Cornell University, 2020.
This item must not be sold to any third party vendors.
This dissertation focuses on the multi-armed bandit problem (MAB) where the objective is a sequential arm selection policy that maximizes the total reward over time. In canonical formulations of MAB, the following assumptions are adopted: the size of the action space is much smaller than the length of the time horizon, computation resources such as memory are unlimited in the learning process, and the generative models of arm rewards are time-invariant. This dissertation aims to relax these assumptions, which are unrealistic in emerging applications involving large-scale complex systems, and develop corresponding techniques to address the resulting new issues.The first part of the dissertation aims to address the issue of a massive number of actions. A stochastic bandit problem with side information on arm similarity and dissimilarity is studied. The main results include a unit interval graph (UIG) representation of the action space that succinctly models the side information and a two-step learning structure that fully exploits the topological structure of the UIG to achieve an optimal scaling of the learning cost with the size of the action space. Specifically, in the UIG representation, each node represents an arm and the presence (absence) of an edge between two nodes indicates similarity (dissimilarity) between their mean rewards. Based on whether the UIG is fully revealed by the side information, two settings with complete and partial side information are considered. For each setting, a two-step learning policy consisting of an offline reduction of the action space and online aggregation of reward observations from similar arms is developed. The computation efficiency and the order optimality of the proposed strategies in terms of the size of the action space and the time length are established. Numerical experiments on both synthetic and real-world datasets are conducted to verify the performance of the proposed policies in practice.In the second part of the dissertation, the issue of limited memory during the learning process is studied in the adversarial bandit setting. Specifically, a learning policy can only store the statistics of a subset of arms summarizing their reward history. A general hierarchical learning structure that trades off the regret order with memory complexity is developed based on multi-level partitions of the arm set into groups and the time horizon into epochs. The proposed learning policy requires only a sublinear order of memory space in terms of the number of arms. Its sublinear regret orders with respect to the time horizon are established for both weak regret and shifting regret in expectation and/or with high probability, when appropriate learning strategies are adopted as subroutines at all levels. By properly choosing the number of levels in the adopted hierarchy, the policy adapts to different sizes of the available memory space. A memory-dependent regret bound is established to characterize the tradeoff between memory complexity and the regret performance of the policy. Numerical examples are provided to verify the performance of the policy.The third part of the dissertation focuses on the issue of time-varying rewards within the contextual bandit framework, which finds applications in various online recommendation systems. The main results include two reward models characterizing the fact that the preferences of users toward different items change asynchronously and distinctly, and a learning algorithm that adapts to the dynamic environment. In particular, the two models assume disjoint and hybrid rewards. In the disjoint setting, the mean reward of playing an arm is determined by an arm-specific preference vector, which is piecewise-stationary with asynchronous change times across arms. In the hybrid setting, the mean reward of an arm also depends on a joint coefficient vector shared by all arms representing the time-invariant component of user interests, in addition to the arm-specific one that is time-varying. Two algorithms based on change detection and restarts are developed in the two settings respectively, of which the performance is verified through simulations on both synthetic and real-world data. Theoretical regret analysis of the algorithm with certain modifications is provided under the disjoint reward model, which shows that a near-optimal regret order in the time length is achieved.
ISBN: 9781083470713Subjects--Topical Terms:
649834
Electrical engineering.
Subjects--Index Terms:
Large-Scale Complex Systems
Multi-Armed Bandits in Large-Scale Complex Systems.
LDR
:05500nmm a2200349 4500
001
2270143
005
20200921070635.5
008
220629s2020 ||||||||||||||||| ||eng d
020
$a
9781083470713
035
$a
(MiAaPQ)AAI27962096
035
$a
AAI27962096
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Xu, Xiao.
$3
1262092
245
1 0
$a
Multi-Armed Bandits in Large-Scale Complex Systems.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2020
300
$a
176 p.
500
$a
Source: Dissertations Abstracts International, Volume: 81-12, Section: B.
500
$a
Advisor: Zhao, Qing.
502
$a
Thesis (Ph.D.)--Cornell University, 2020.
506
$a
This item must not be sold to any third party vendors.
520
$a
This dissertation focuses on the multi-armed bandit problem (MAB) where the objective is a sequential arm selection policy that maximizes the total reward over time. In canonical formulations of MAB, the following assumptions are adopted: the size of the action space is much smaller than the length of the time horizon, computation resources such as memory are unlimited in the learning process, and the generative models of arm rewards are time-invariant. This dissertation aims to relax these assumptions, which are unrealistic in emerging applications involving large-scale complex systems, and develop corresponding techniques to address the resulting new issues.The first part of the dissertation aims to address the issue of a massive number of actions. A stochastic bandit problem with side information on arm similarity and dissimilarity is studied. The main results include a unit interval graph (UIG) representation of the action space that succinctly models the side information and a two-step learning structure that fully exploits the topological structure of the UIG to achieve an optimal scaling of the learning cost with the size of the action space. Specifically, in the UIG representation, each node represents an arm and the presence (absence) of an edge between two nodes indicates similarity (dissimilarity) between their mean rewards. Based on whether the UIG is fully revealed by the side information, two settings with complete and partial side information are considered. For each setting, a two-step learning policy consisting of an offline reduction of the action space and online aggregation of reward observations from similar arms is developed. The computation efficiency and the order optimality of the proposed strategies in terms of the size of the action space and the time length are established. Numerical experiments on both synthetic and real-world datasets are conducted to verify the performance of the proposed policies in practice.In the second part of the dissertation, the issue of limited memory during the learning process is studied in the adversarial bandit setting. Specifically, a learning policy can only store the statistics of a subset of arms summarizing their reward history. A general hierarchical learning structure that trades off the regret order with memory complexity is developed based on multi-level partitions of the arm set into groups and the time horizon into epochs. The proposed learning policy requires only a sublinear order of memory space in terms of the number of arms. Its sublinear regret orders with respect to the time horizon are established for both weak regret and shifting regret in expectation and/or with high probability, when appropriate learning strategies are adopted as subroutines at all levels. By properly choosing the number of levels in the adopted hierarchy, the policy adapts to different sizes of the available memory space. A memory-dependent regret bound is established to characterize the tradeoff between memory complexity and the regret performance of the policy. Numerical examples are provided to verify the performance of the policy.The third part of the dissertation focuses on the issue of time-varying rewards within the contextual bandit framework, which finds applications in various online recommendation systems. The main results include two reward models characterizing the fact that the preferences of users toward different items change asynchronously and distinctly, and a learning algorithm that adapts to the dynamic environment. In particular, the two models assume disjoint and hybrid rewards. In the disjoint setting, the mean reward of playing an arm is determined by an arm-specific preference vector, which is piecewise-stationary with asynchronous change times across arms. In the hybrid setting, the mean reward of an arm also depends on a joint coefficient vector shared by all arms representing the time-invariant component of user interests, in addition to the arm-specific one that is time-varying. Two algorithms based on change detection and restarts are developed in the two settings respectively, of which the performance is verified through simulations on both synthetic and real-world data. Theoretical regret analysis of the algorithm with certain modifications is provided under the disjoint reward model, which shows that a near-optimal regret order in the time length is achieved.
590
$a
School code: 0058.
650
4
$a
Electrical engineering.
$3
649834
650
4
$a
Computer science.
$3
523869
650
4
$a
Operations research.
$3
547123
653
$a
Large-Scale Complex Systems
653
$a
Multi-Armed Bandits
653
$a
No-regret Learning
690
$a
0544
690
$a
0984
690
$a
0796
710
2
$a
Cornell University.
$b
Electrical and Computer Engineering.
$3
2100661
773
0
$t
Dissertations Abstracts International
$g
81-12B.
790
$a
0058
791
$a
Ph.D.
792
$a
2020
793
$a
English
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27962096
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9422377
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入