語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
到查詢結果
[ null ]
切換:
標籤
|
MARC模式
|
ISBD
Feature selection with missing data.
~
Sarkar, Saurabh.
FindBook
Google Book
Amazon
博客來
Feature selection with missing data.
紀錄類型:
書目-語言資料,印刷品 : Monograph/item
正題名/作者:
Feature selection with missing data./
作者:
Sarkar, Saurabh.
面頁冊數:
107 p.
附註:
Source: Dissertation Abstracts International, Volume: 75-02(E), Section: B.
Contained By:
Dissertation Abstracts International75-02B(E).
標題:
Engineering, General. -
電子資源:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3601418
ISBN:
9781303525742
Feature selection with missing data.
Sarkar, Saurabh.
Feature selection with missing data.
- 107 p.
Source: Dissertation Abstracts International, Volume: 75-02(E), Section: B.
Thesis (Ph.D.)--University of Cincinnati, 2013.
In the modern world information has become the new power. An increasing amount of efforts are being made to gather data, resources being allocated, time being invested and tools being developed. Data collection is no longer a myth; however, it remains a great challenge to create value out of the enormous data that is being collected. Data modeling is one of the ways in which data is being utilized. When we try to model a process or a system, it is crucial to have the right features, and thus, feature selection has become an essential part of data modeling. Yet often we have missing data, and in a worse scenario, the important features themselves may have considerable data missing. The challenge is to pick out the best features and yet accommodate the missing data. To address this problem, this dissertation introduces a cluster based feature selection process which is quite robust in handling missing data. The research extends the Minimum Expected Cost of Misclassification (MECM) based feature selection method to a very high dimensional dataset by using cluster based sampling methods. However, even though the cluster based sampling methods allow the MECM to scale to larger datasets, determining the optimal cluster size is still a challenge. This is the first issue that the dissertation aims to solve. The second area that the dissertation tries to address is the issue of handling missing data while doing feature selection by MECM based method. This area has not been studied extensively as feature selection itself, though missing data is witnessed quite often. The dissertation discusses an algorithm which enables the MECM to handle missing data. This approach is a probabilistic approach based on the distribution of most similar instances. The algorithm determines the probability of having the instance in the sampling cluster and then does a fractional count while evaluating the MECM. One of the challenges of this approach is to correctly estimate the probability of a missing point lying within the sampling cluster. The key lies in picking up the correct number of similar instances to calculate the probability. The dissertation also seeks to address this problem. The last part of the research contains a benchmark study to determine the effectiveness of the algorithm. A wrapper based feature selection method using Naive Bayesian and another method using the MECM without missing data algorithm are used simultaneously as benchmarks. The MECM missing data algorithm showed a significant improvement over the other two. Solving these problems is of great practical significance to data modeling. Instances with missing data might carry critical information; ignoring missing data during feature selection can have a cascading effect downstream when the final model is built. This research will enable us to choose better features which would in return improve the accuracy of the existing models. It will impact a broad range of applications from gene based medicine, fraud detection models, engineering, business and any field which uses feature selection as one of the components in model building process.
ISBN: 9781303525742Subjects--Topical Terms:
1020744
Engineering, General.
Feature selection with missing data.
LDR
:03980nam a2200277 4500
001
1966836
005
20141112075118.5
008
150210s2013 ||||||||||||||||| ||eng d
020
$a
9781303525742
035
$a
(MiAaPQ)AAI3601418
035
$a
AAI3601418
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Sarkar, Saurabh.
$3
2103720
245
1 0
$a
Feature selection with missing data.
300
$a
107 p.
500
$a
Source: Dissertation Abstracts International, Volume: 75-02(E), Section: B.
500
$a
Adviser: Hongdao Huang.
502
$a
Thesis (Ph.D.)--University of Cincinnati, 2013.
520
$a
In the modern world information has become the new power. An increasing amount of efforts are being made to gather data, resources being allocated, time being invested and tools being developed. Data collection is no longer a myth; however, it remains a great challenge to create value out of the enormous data that is being collected. Data modeling is one of the ways in which data is being utilized. When we try to model a process or a system, it is crucial to have the right features, and thus, feature selection has become an essential part of data modeling. Yet often we have missing data, and in a worse scenario, the important features themselves may have considerable data missing. The challenge is to pick out the best features and yet accommodate the missing data. To address this problem, this dissertation introduces a cluster based feature selection process which is quite robust in handling missing data. The research extends the Minimum Expected Cost of Misclassification (MECM) based feature selection method to a very high dimensional dataset by using cluster based sampling methods. However, even though the cluster based sampling methods allow the MECM to scale to larger datasets, determining the optimal cluster size is still a challenge. This is the first issue that the dissertation aims to solve. The second area that the dissertation tries to address is the issue of handling missing data while doing feature selection by MECM based method. This area has not been studied extensively as feature selection itself, though missing data is witnessed quite often. The dissertation discusses an algorithm which enables the MECM to handle missing data. This approach is a probabilistic approach based on the distribution of most similar instances. The algorithm determines the probability of having the instance in the sampling cluster and then does a fractional count while evaluating the MECM. One of the challenges of this approach is to correctly estimate the probability of a missing point lying within the sampling cluster. The key lies in picking up the correct number of similar instances to calculate the probability. The dissertation also seeks to address this problem. The last part of the research contains a benchmark study to determine the effectiveness of the algorithm. A wrapper based feature selection method using Naive Bayesian and another method using the MECM without missing data algorithm are used simultaneously as benchmarks. The MECM missing data algorithm showed a significant improvement over the other two. Solving these problems is of great practical significance to data modeling. Instances with missing data might carry critical information; ignoring missing data during feature selection can have a cascading effect downstream when the final model is built. This research will enable us to choose better features which would in return improve the accuracy of the existing models. It will impact a broad range of applications from gene based medicine, fraud detection models, engineering, business and any field which uses feature selection as one of the components in model building process.
590
$a
School code: 0045.
650
4
$a
Engineering, General.
$3
1020744
650
4
$a
Information Science.
$3
1017528
690
$a
0537
690
$a
0723
710
2
$a
University of Cincinnati.
$b
Industrial Engineering.
$3
1683958
773
0
$t
Dissertation Abstracts International
$g
75-02B(E).
790
$a
0045
791
$a
Ph.D.
792
$a
2013
793
$a
English
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3601418
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9261842
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入
(1)帳號:一般為「身分證號」;外籍生或交換生則為「學號」。 (2)密碼:預設為帳號末四碼。
帳號
.
密碼
.
請在此電腦上記得個人資料
取消
忘記密碼? (請注意!您必須已在系統登記E-mail信箱方能使用。)