語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
FindBook
Google Book
Amazon
博客來
On the Statistical Complexity of Offline Policy Evaluation for Tabular Reinforcement Learning.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
On the Statistical Complexity of Offline Policy Evaluation for Tabular Reinforcement Learning./
作者:
Yin, Ming.
面頁冊數:
1 online resource (178 pages)
附註:
Source: Dissertations Abstracts International, Volume: 84-11, Section: B.
Contained By:
Dissertations Abstracts International84-11B.
標題:
Statistics. -
電子資源:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30313869click for full text (PQDT)
ISBN:
9798379496500
On the Statistical Complexity of Offline Policy Evaluation for Tabular Reinforcement Learning.
Yin, Ming.
On the Statistical Complexity of Offline Policy Evaluation for Tabular Reinforcement Learning.
- 1 online resource (178 pages)
Source: Dissertations Abstracts International, Volume: 84-11, Section: B.
Thesis (Ph.D.)--University of California, Santa Barbara, 2023.
Includes bibliographical references
Offline Policy Evaluation (OPE) aims at evaluating the expected cumulative reward of a target policy \uD835\uDF0B when offline data are collected by running a logging policy \uD835\uDF07. Standard importancesampling based approaches for this problem suffer from a variance that scales exponentially with time horizon \uD835\uDC3B, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon". In the Second chapter of this thesis, we prove the modification of Marginalized Importance Sampling (MIS) method can achieve the Cramer-Rao lower bound, provided that the state space and the action space are finite. In the Third chapter of the thesis, we go beyond the off-policy evaluation setting and propose a new uniform convergence for OPE. The Uniform OPE problem requires evaluating all the policies in a policy class Π simultaneously, and we obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of \uD835\uDC42(\uD835\uDC3B3∕\uD835\uDC51\uD835\uDC5A \uD835\uDF16 2 ) in identifying an \uD835\uDF16-optimal policy under the timeinhomogeneous episodic MDP model. Here \uD835\uDC51\uD835\uDC5A is the minimal marginal state-action visitation probability for the current MDP under the behavior policy \uD835\uDF07. We further improve the sample complexity guarantee to \uD835\uDC42(\uD835\uDC3B2∕\uD835\uDC51\uD835\uDC5A \uD835\uDF16 2 ) under the time-homogeneous episodic MDPs, using a novel singleton-absorbing MDP technique in the Fourth chapter. Both results are known to be optimal under their respective settings. In the final part of the thesis, we summarise our work in reinforcement learning and conclude with potential future directions.
Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2023
Mode of access: World Wide Web
ISBN: 9798379496500Subjects--Topical Terms:
517247
Statistics.
Subjects--Index Terms:
Offline reinforcement learningIndex Terms--Genre/Form:
542853
Electronic books.
On the Statistical Complexity of Offline Policy Evaluation for Tabular Reinforcement Learning.
LDR
:03203nmm a2200373K 4500
001
2356322
005
20230612072307.5
006
m o d
007
cr mn ---uuuuu
008
241011s2023 xx obm 000 0 eng d
020
$a
9798379496500
035
$a
(MiAaPQ)AAI30313869
035
$a
AAI30313869
040
$a
MiAaPQ
$b
eng
$c
MiAaPQ
$d
NTU
100
1
$a
Yin, Ming.
$3
1058981
245
1 0
$a
On the Statistical Complexity of Offline Policy Evaluation for Tabular Reinforcement Learning.
264
0
$c
2023
300
$a
1 online resource (178 pages)
336
$a
text
$b
txt
$2
rdacontent
337
$a
computer
$b
c
$2
rdamedia
338
$a
online resource
$b
cr
$2
rdacarrier
500
$a
Source: Dissertations Abstracts International, Volume: 84-11, Section: B.
500
$a
Advisor: Jammalamadaka, S. Rao ; Wang, Yu-Xiang.
502
$a
Thesis (Ph.D.)--University of California, Santa Barbara, 2023.
504
$a
Includes bibliographical references
520
$a
Offline Policy Evaluation (OPE) aims at evaluating the expected cumulative reward of a target policy \uD835\uDF0B when offline data are collected by running a logging policy \uD835\uDF07. Standard importancesampling based approaches for this problem suffer from a variance that scales exponentially with time horizon \uD835\uDC3B, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon". In the Second chapter of this thesis, we prove the modification of Marginalized Importance Sampling (MIS) method can achieve the Cramer-Rao lower bound, provided that the state space and the action space are finite. In the Third chapter of the thesis, we go beyond the off-policy evaluation setting and propose a new uniform convergence for OPE. The Uniform OPE problem requires evaluating all the policies in a policy class Π simultaneously, and we obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of \uD835\uDC42(\uD835\uDC3B3∕\uD835\uDC51\uD835\uDC5A \uD835\uDF16 2 ) in identifying an \uD835\uDF16-optimal policy under the timeinhomogeneous episodic MDP model. Here \uD835\uDC51\uD835\uDC5A is the minimal marginal state-action visitation probability for the current MDP under the behavior policy \uD835\uDF07. We further improve the sample complexity guarantee to \uD835\uDC42(\uD835\uDC3B2∕\uD835\uDC51\uD835\uDC5A \uD835\uDF16 2 ) under the time-homogeneous episodic MDPs, using a novel singleton-absorbing MDP technique in the Fourth chapter. Both results are known to be optimal under their respective settings. In the final part of the thesis, we summarise our work in reinforcement learning and conclude with potential future directions.
533
$a
Electronic reproduction.
$b
Ann Arbor, Mich. :
$c
ProQuest,
$d
2023
538
$a
Mode of access: World Wide Web
650
4
$a
Statistics.
$3
517247
650
4
$a
Applied mathematics.
$3
2122814
653
$a
Offline reinforcement learning
653
$a
Statistical machine learning
653
$a
Policy evaluation
653
$a
Target policy
655
7
$a
Electronic books.
$2
lcsh
$3
542853
690
$a
0463
690
$a
0364
710
2
$a
ProQuest Information and Learning Co.
$3
783688
710
2
$a
University of California, Santa Barbara.
$b
Statistics and Applied Probability.
$3
3170160
773
0
$t
Dissertations Abstracts International
$g
84-11B.
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30313869
$z
click for full text (PQDT)
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9478678
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入