東華大學圖書館 |

Language: English

Help

回圖書館首頁

手機版館藏查詢

Back

Switch To: Labeled | MARC Mode | ISBD

Feature selection with missing data.

Sarkar, Saurabh.

Linked to FindBook

Google Book

Amazon

博客來

Feature selection with missing data.

Record Type:	Language materials, printed : Monograph/item
Title/Author:	Feature selection with missing data./
Author:	Sarkar, Saurabh.
Description:	107 p.
Notes:	Source: Dissertation Abstracts International, Volume: 75-02(E), Section: B.
Contained By:	Dissertation Abstracts International75-02B(E).
Subject:	Engineering, General. -
Online resource:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3601418
ISBN:	9781303525742

Feature selection with missing data.
Sarkar, Saurabh.

Feature selection with missing data. - 107 p.

Source: Dissertation Abstracts International, Volume: 75-02(E), Section: B.

Thesis (Ph.D.)--University of Cincinnati, 2013.

In the modern world information has become the new power. An increasing amount of efforts are being made to gather data, resources being allocated, time being invested and tools being developed. Data collection is no longer a myth; however, it remains a great challenge to create value out of the enormous data that is being collected. Data modeling is one of the ways in which data is being utilized. When we try to model a process or a system, it is crucial to have the right features, and thus, feature selection has become an essential part of data modeling. Yet often we have missing data, and in a worse scenario, the important features themselves may have considerable data missing. The challenge is to pick out the best features and yet accommodate the missing data. To address this problem, this dissertation introduces a cluster based feature selection process which is quite robust in handling missing data. The research extends the Minimum Expected Cost of Misclassification (MECM) based feature selection method to a very high dimensional dataset by using cluster based sampling methods. However, even though the cluster based sampling methods allow the MECM to scale to larger datasets, determining the optimal cluster size is still a challenge. This is the first issue that the dissertation aims to solve. The second area that the dissertation tries to address is the issue of handling missing data while doing feature selection by MECM based method. This area has not been studied extensively as feature selection itself, though missing data is witnessed quite often. The dissertation discusses an algorithm which enables the MECM to handle missing data. This approach is a probabilistic approach based on the distribution of most similar instances. The algorithm determines the probability of having the instance in the sampling cluster and then does a fractional count while evaluating the MECM. One of the challenges of this approach is to correctly estimate the probability of a missing point lying within the sampling cluster. The key lies in picking up the correct number of similar instances to calculate the probability. The dissertation also seeks to address this problem. The last part of the research contains a benchmark study to determine the effectiveness of the algorithm. A wrapper based feature selection method using Naive Bayesian and another method using the MECM without missing data algorithm are used simultaneously as benchmarks. The MECM missing data algorithm showed a significant improvement over the other two. Solving these problems is of great practical significance to data modeling. Instances with missing data might carry critical information; ignoring missing data during feature selection can have a cascading effect downstream when the final model is built. This research will enable us to choose better features which would in return improve the accuracy of the existing models. It will impact a broad range of applications from gene based medicine, fraud detection models, engineering, business and any field which uses feature selection as one of the components in model building process.

ISBN: 9781303525742Subjects--Topical Terms:

1020744
Engineering, General.

Feature selection with missing data.
LDR:03980nam a2200277 4500 001 1966836
005 20141112075118.5
008 150210s2013 ||||||||||||||||| ||eng d
020 $a 9781303525742
035 $a (MiAaPQ)AAI3601418
035 $a AAI3601418
040 $a MiAaPQ $c MiAaPQ
100 1 $a Sarkar, Saurabh. $3 2103720
245 1 0 $a Feature selection with missing data.
300 $a 107 p.
500 $a Source: Dissertation Abstracts International, Volume: 75-02(E), Section: B.
500 $a Adviser: Hongdao Huang.
502 $a Thesis (Ph.D.)--University of Cincinnati, 2013.
520 $a In the modern world information has become the new power. An increasing amount of efforts are being made to gather data, resources being allocated, time being invested and tools being developed. Data collection is no longer a myth; however, it remains a great challenge to create value out of the enormous data that is being collected. Data modeling is one of the ways in which data is being utilized. When we try to model a process or a system, it is crucial to have the right features, and thus, feature selection has become an essential part of data modeling. Yet often we have missing data, and in a worse scenario, the important features themselves may have considerable data missing. The challenge is to pick out the best features and yet accommodate the missing data. To address this problem, this dissertation introduces a cluster based feature selection process which is quite robust in handling missing data. The research extends the Minimum Expected Cost of Misclassification (MECM) based feature selection method to a very high dimensional dataset by using cluster based sampling methods. However, even though the cluster based sampling methods allow the MECM to scale to larger datasets, determining the optimal cluster size is still a challenge. This is the first issue that the dissertation aims to solve. The second area that the dissertation tries to address is the issue of handling missing data while doing feature selection by MECM based method. This area has not been studied extensively as feature selection itself, though missing data is witnessed quite often. The dissertation discusses an algorithm which enables the MECM to handle missing data. This approach is a probabilistic approach based on the distribution of most similar instances. The algorithm determines the probability of having the instance in the sampling cluster and then does a fractional count while evaluating the MECM. One of the challenges of this approach is to correctly estimate the probability of a missing point lying within the sampling cluster. The key lies in picking up the correct number of similar instances to calculate the probability. The dissertation also seeks to address this problem. The last part of the research contains a benchmark study to determine the effectiveness of the algorithm. A wrapper based feature selection method using Naive Bayesian and another method using the MECM without missing data algorithm are used simultaneously as benchmarks. The MECM missing data algorithm showed a significant improvement over the other two. Solving these problems is of great practical significance to data modeling. Instances with missing data might carry critical information; ignoring missing data during feature selection can have a cascading effect downstream when the final model is built. This research will enable us to choose better features which would in return improve the accuracy of the existing models. It will impact a broad range of applications from gene based medicine, fraud detection models, engineering, business and any field which uses feature selection as one of the components in model building process.
590 $a School code: 0045.
650 4 $a Engineering, General. $3 1020744
650 4 $a Information Science. $3 1017528
690 $a 0537
690 $a 0723
710 2 $a University of Cincinnati. $b Industrial Engineering. $3 1683958
773 0 $t Dissertation Abstracts International $g 75-02B(E).
790 $a 0045
791 $a Ph.D.
792 $a 2013
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3601418