東華大學圖書館 |

語系: 繁體中文

說明(常見問題)

回圖書館首頁

手機版館藏查詢

登入

回首頁 到查詢結果 [ subject:"Information science." ]

切換: 標籤 | MARC模式 | ISBD

Enhancing Clustering and Labeling fo...

Gong, Xuemei.

FindBook

Google Book

Amazon

博客來

Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems./
作者:	Gong, Xuemei.
面頁冊數:	118 p.
附註:	Source: Dissertation Abstracts International, Volume: 77-05(E), Section: A.
Contained By:	Dissertation Abstracts International77-05A(E).
標題:	Information science. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3742950
ISBN:	9781339355979

Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems.
Gong, Xuemei.

Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems. - 118 p.

Source: Dissertation Abstracts International, Volume: 77-05(E), Section: A.

Thesis (Ph.D.)--Drexel University, 2015.

Classic information retrieval (IR) systems rely on ranking algorithms to serve users with ordered lists of documents according to search queries. Sometimes, however, users do not have very specific information needs or cannot accurately articulate their information needs in queries. Cluster-based IR systems, such as those based on the Scatter/Gather paradigm, have been used to help users clarify their information needs and promote learning via interactive document clustering and summarization. These systems have the potential to facilitate user browsing large document collections and exploring topics. However, their effectiveness is often constrained by poor clustering quality, ambiguous cluster labels, and the inefficiency to process large-scale data sets.

ISBN: 9781339355979Subjects--Topical Terms:

554358
Information science.

Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems.
LDR:04469nmm a2200313 4500 001 2069123
005 20160507120524.5
008 170521s2015 ||||||||||||||||| ||eng d
020 $a 9781339355979
035 $a (MiAaPQ)AAI3742950
035 $a AAI3742950
040 $a MiAaPQ $c MiAaPQ
100 1 $a Gong, Xuemei. $3 1279964
245 1 0 $a Enhancing Clustering and Labeling for Large-Scale Information Retrieval Systems.
300 $a 118 p.
500 $a Source: Dissertation Abstracts International, Volume: 77-05(E), Section: A.
500 $a Adviser: Weimao Ke.
502 $a Thesis (Ph.D.)--Drexel University, 2015.
520 $a Classic information retrieval (IR) systems rely on ranking algorithms to serve users with ordered lists of documents according to search queries. Sometimes, however, users do not have very specific information needs or cannot accurately articulate their information needs in queries. Cluster-based IR systems, such as those based on the Scatter/Gather paradigm, have been used to help users clarify their information needs and promote learning via interactive document clustering and summarization. These systems have the potential to facilitate user browsing large document collections and exploring topics. However, their effectiveness is often constrained by poor clustering quality, ambiguous cluster labels, and the inefficiency to process large-scale data sets.
520 $a In interactive clustering, term distributions vary in different clusters or subsets of a collection. Classic TF*IDF (term frequency * inverse document frequency) term weighting, especially IDF that counts document frequency in the overall (global) data, does not take into account the shifted term distributions in a (local) subset and is often incapable of identifying most informative terms within that subset. To improve clustering quality with meaningful labels, we propose two novel term weighting schemes, namely TF*ICDF and DF*LIG. TF*ICDF, or Term Frequency * Inverse within-Cluster Document Frequency, integrates the local subset information into term weighting. It outperforms TF*IDF in several aspects for clustering and labeling with various configurations.
520 $a In addition, we propose Least Information Gain (LIG) based on the least information theory, which, similar to Information Gain (IG) based on KL divergence, measures the amount of information required for a probability distribution change. Based on LIG, we develop the DF*LIG method for cluster labeling. With DF*LIG, terms that carry more information in revealing the contents of clusters are chosen as labels, resulting in better performance in terms of coverage, overlap and precision in comparison to DF*IG. By integrating TF*ICDF for term weighting and clustering, DF*LIG produces more representative, distinctive and accurate labels than when it is combined with TF*IDF.
520 $a In order to improve clustering efficiency and support data-intensive processing, we develop distributed versions of TF*ICDF and DF*LIG algorithms as well as a parallel clustering algorithm named Pruned Affinity Propagation (PAP) in the Spark framework. The proposed algorithms efficiently process large-scale data sets by taking advantage of computational capabilities of individual processors and nodes. Distributed TF*ICDF and DF*LIG methods scale very well---their efficiency improves significantly with an increased number of processors. Compared with the original affinity propagation algorithm, PAP achieves much higher efficiency while maintaining strong effectiveness. Results also show that the execution time of PAP is greatly reduced by increasing the number of processors and remains competitive with large numbers of documents, indicating its scalability.
520 $a With the support of these effective and scalable methods for text clustering and cluster labeling, a cluster-based IR system can be greatly improved in its ability to dynamically identify key features, to produce meaningful clusters, and to generate representative terms as labels. With the ability to accommodate large-scale data sets, such a system can help users discover important patterns in the data and help them learn and explore in a dynamic, complex information space.
590 $a School code: 0065.
650 4 $a Information science. $3 554358
690 $a 0723
710 2 $a Drexel University. $b Information Studies (College of Computing and Informatics). $3 2104098
773 0 $t Dissertation Abstracts International $g 77-05A(E).
790 $a 0065
791 $a Ph.D.
792 $a 2015
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3742950