Language:
English
繁體中文
Help
回圖書館首頁
手機版館藏查詢
Login
Back
Switch To:
Labeled
|
MARC Mode
|
ISBD
Temporal Learning for Video-Language...
~
Zhang, Songyang.
Linked to FindBook
Google Book
Amazon
博客來
Temporal Learning for Video-Language Understanding and Generation.
Record Type:
Electronic resources : Monograph/item
Title/Author:
Temporal Learning for Video-Language Understanding and Generation./
Author:
Zhang, Songyang.
Published:
Ann Arbor : ProQuest Dissertations & Theses, : 2023,
Description:
249 p.
Notes:
Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
Contained By:
Dissertations Abstracts International85-03A.
Subject:
Computer science. -
Online resource:
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30572454
ISBN:
9798380318648
Temporal Learning for Video-Language Understanding and Generation.
Zhang, Songyang.
Temporal Learning for Video-Language Understanding and Generation.
- Ann Arbor : ProQuest Dissertations & Theses, 2023 - 249 p.
Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
Thesis (Ph.D.)--University of Rochester, 2023.
Vision-language studies jointly perceive, understand, and generate over the vision and language modalities to perform various tasks, such as retrieve an image/video by a given sentence or generate an image/video by a given sentence. Great success has been made in image-language studies, however, video-language studies still lag behind. Different from image studies that mainly focus on static objects or scenes, the core challenge in video studies is how to further learn dynamic changes. Time is an intrinsic attribute for both video and language. How to encode time has been studied in both CV and NLP communities, however, their alignments and interactions has been rarely studied until recently.In this thesis, we study temporal learning of video-language tasks from both video and language aspects. In order to interact between different modalities, it's natural to ask how to learn alignment between these two modalities? In the first part, we answer this question by studying a specific video-language alignment task, moment localization with natural language. This task aims to retrieve a specific moment from an untrimmed video by a query sentence. We study this problem from two aspects, video context modeling and temporal language modeling. If we have learnt a good video-language alignment, a following question would be could we leverage such knowledge and benefit some conventional NLP tasks? Or could we learn CV tasks with language as guidance?In the second part, we first investigate a conventional NLP problem, grammar induction. Grammar induction aims to find hierarchical syntactic structures from plain sentences. We found that leveraging the regularities between video and text can improve parser's performance. We further investigate the dataset limitation of this approach and propose a solution by leveraging instructional videos without any human efforts.In the third part, we learn video generation with language. We investigate this problem from dataset collection, spatial-temporal modeling and efficiency. We develop a dataset to enable focused advances in some of the core challenges of multimodal video research. We also leverage text-to-image models to learn the correspondence between text and the visual world, and uses unsupervised learning on unlabeled (unpaired) video data, to learn realistic motion. We also propose a novel temporal shift module to leverage a T2I model as-is for T2V generation without adding any new parameters.Upon these thesis works, we present several exciting research directions for future studies.
ISBN: 9798380318648Subjects--Topical Terms:
523869
Computer science.
Subjects--Index Terms:
Temporal learning
Temporal Learning for Video-Language Understanding and Generation.
LDR
:03728nmm a2200385 4500
001
2399400
005
20240916065429.5
006
m o d
007
cr#unu||||||||
008
251215s2023 ||||||||||||||||| ||eng d
020
$a
9798380318648
035
$a
(MiAaPQ)AAI30572454
035
$a
AAI30572454
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Zhang, Songyang.
$3
3681695
245
1 0
$a
Temporal Learning for Video-Language Understanding and Generation.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2023
300
$a
249 p.
500
$a
Source: Dissertations Abstracts International, Volume: 85-03, Section: A.
500
$a
Advisor: Luo, Jiebo.
502
$a
Thesis (Ph.D.)--University of Rochester, 2023.
520
$a
Vision-language studies jointly perceive, understand, and generate over the vision and language modalities to perform various tasks, such as retrieve an image/video by a given sentence or generate an image/video by a given sentence. Great success has been made in image-language studies, however, video-language studies still lag behind. Different from image studies that mainly focus on static objects or scenes, the core challenge in video studies is how to further learn dynamic changes. Time is an intrinsic attribute for both video and language. How to encode time has been studied in both CV and NLP communities, however, their alignments and interactions has been rarely studied until recently.In this thesis, we study temporal learning of video-language tasks from both video and language aspects. In order to interact between different modalities, it's natural to ask how to learn alignment between these two modalities? In the first part, we answer this question by studying a specific video-language alignment task, moment localization with natural language. This task aims to retrieve a specific moment from an untrimmed video by a query sentence. We study this problem from two aspects, video context modeling and temporal language modeling. If we have learnt a good video-language alignment, a following question would be could we leverage such knowledge and benefit some conventional NLP tasks? Or could we learn CV tasks with language as guidance?In the second part, we first investigate a conventional NLP problem, grammar induction. Grammar induction aims to find hierarchical syntactic structures from plain sentences. We found that leveraging the regularities between video and text can improve parser's performance. We further investigate the dataset limitation of this approach and propose a solution by leveraging instructional videos without any human efforts.In the third part, we learn video generation with language. We investigate this problem from dataset collection, spatial-temporal modeling and efficiency. We develop a dataset to enable focused advances in some of the core challenges of multimodal video research. We also leverage text-to-image models to learn the correspondence between text and the visual world, and uses unsupervised learning on unlabeled (unpaired) video data, to learn realistic motion. We also propose a novel temporal shift module to leverage a T2I model as-is for T2V generation without adding any new parameters.Upon these thesis works, we present several exciting research directions for future studies.
590
$a
School code: 0188.
650
4
$a
Computer science.
$3
523869
650
4
$a
Information technology.
$3
532993
650
4
$a
Multimedia communications.
$3
590562
653
$a
Temporal learning
653
$a
Vision-language studies
653
$a
Video-language alignment
653
$a
Video generation
653
$a
Natural language processing
690
$a
0984
690
$a
0489
690
$a
0558
710
2
$a
University of Rochester.
$b
Hajim School of Engineering and Applied Sciences.
$3
2099687
773
0
$t
Dissertations Abstracts International
$g
85-03A.
790
$a
0188
791
$a
Ph.D.
792
$a
2023
793
$a
English
856
4 0
$u
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30572454
based on 0 review(s)
Location:
ALL
電子資源
Year:
Volume Number:
Items
1 records • Pages 1 •
1
Inventory Number
Location Name
Item Class
Material type
Call number
Usage Class
Loan Status
No. of reservations
Opac note
Attachments
W9507720
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
On shelf
0
1 records • Pages 1 •
1
Multimedia
Reviews
Add a review
and share your thoughts with other readers
Export
pickup library
Processing
...
Change password
Login