語系:
繁體中文
English
說明(常見問題)
回圖書館首頁
手機版館藏查詢
登入
回首頁
到查詢結果
[ null ]
切換:
標籤
|
MARC模式
|
ISBD
Missing Data Imputation Using Machin...
~
Choudhury, Arkopal.
FindBook
Google Book
Amazon
博客來
Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes./
作者:
Choudhury, Arkopal.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, : 2020,
面頁冊數:
95 p.
附註:
Source: Dissertations Abstracts International, Volume: 82-01, Section: B.
Contained By:
Dissertations Abstracts International82-01B.
標題:
Biostatistics. -
電子資源:
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27959806
ISBN:
9798635254479
Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes.
Choudhury, Arkopal.
Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes.
- Ann Arbor : ProQuest Dissertations & Theses, 2020 - 95 p.
Source: Dissertations Abstracts International, Volume: 82-01, Section: B.
Thesis (Ph.D.)--The University of North Carolina at Chapel Hill, 2020.
This item must not be sold to any third party vendors.
Imputation of missing data is a common application in supervised classification problems, where the feature matrix of the training dataset has various degrees of missingness. Most of the former studies do not take into account the presence of the class label in the classification problem with missing data. A widely used solution to this problem is missing data imputation based on the lazy learning technique, k-Nearest Neighbor (KNN) approach. We work on a variant of this imputation algorithm using Gray's distance and Mutual Information (MI), called Class-weighted Gray's k-Nearest Neighbor (CGKNN) approach. Gray's distance works well with heterogeneous mixed-type data with missing instances, and we weigh distance with mutual information (MI), a measure of feature relevance, between the features and the class label. This method performs better than traditional methods for classification problems with mixed data, as shown in simulations and applications on University of California, Irvine (UCI) Machine Learning datasets (http://archive.ics.uci.edu/ml/index.php).Data being lost to follow up is a common problem in longitudinal data, especially if it involves multiple visits over a long period of time. If the outcome of interest is present in each time point, despite missing covariates due to follow-up (like outcome ascertained through phone calls), then random forest imputation would be a good imputation technique for the missing covariates. The missingness of the data involves more complicated interactions over time since most of the covariates and the outcome have repeated measurements over time. Random forests are a good non-parametric learning technique which captures complex interactions between mixed type data. We propose a proximity imputation and missForest type covariate imputation with random splits while building the forest. The performance of the imputation techniques used is compared to existing techniques in various simulation settings.The Atherosclerosis Risk in Communities (ARIC) Study Cohort is a longitudinal study which started in 1987-1989 to collect data on participants across 4 states in the USA, aimed at studying the factors behind heart diseases. We consider patients at the 5th visit (occurred in 2013) and enrolled in continuous Medicare Fee-For-Service (FFS) insurance in the last 6 months prior to their visit so that their hospitalization diagnostic (ICD) codes are available. Our aim is to characterize the hospitalization of patients having cognitive status ascertainment (classified into dementia, mild cognitive disorder or no cognitive disorder) in the 5th visit. Diagnostic codes for inpatient and outpatient visits identified from CMS (Centers for Medicare & Medicaid Services) Medicare FFS data linked with ARIC participant data are stored in the form of International Classification of Diseases and related health problems (ICD) codes. We treat these codes as a bag-of-words model to apply text mining techniques and get meaningful cluster of ICD codes.
ISBN: 9798635254479Subjects--Topical Terms:
1002712
Biostatistics.
Subjects--Index Terms:
Clustering
Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes.
LDR
:04295nmm a2200397 4500
001
2276847
005
20210510092425.5
008
220723s2020 ||||||||||||||||| ||eng d
020
$a
9798635254479
035
$a
(MiAaPQ)AAI27959806
035
$a
AAI27959806
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Choudhury, Arkopal.
$3
3555149
245
1 0
$a
Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2020
300
$a
95 p.
500
$a
Source: Dissertations Abstracts International, Volume: 82-01, Section: B.
500
$a
Advisor: Kosorok, Michael R.
502
$a
Thesis (Ph.D.)--The University of North Carolina at Chapel Hill, 2020.
506
$a
This item must not be sold to any third party vendors.
520
$a
Imputation of missing data is a common application in supervised classification problems, where the feature matrix of the training dataset has various degrees of missingness. Most of the former studies do not take into account the presence of the class label in the classification problem with missing data. A widely used solution to this problem is missing data imputation based on the lazy learning technique, k-Nearest Neighbor (KNN) approach. We work on a variant of this imputation algorithm using Gray's distance and Mutual Information (MI), called Class-weighted Gray's k-Nearest Neighbor (CGKNN) approach. Gray's distance works well with heterogeneous mixed-type data with missing instances, and we weigh distance with mutual information (MI), a measure of feature relevance, between the features and the class label. This method performs better than traditional methods for classification problems with mixed data, as shown in simulations and applications on University of California, Irvine (UCI) Machine Learning datasets (http://archive.ics.uci.edu/ml/index.php).Data being lost to follow up is a common problem in longitudinal data, especially if it involves multiple visits over a long period of time. If the outcome of interest is present in each time point, despite missing covariates due to follow-up (like outcome ascertained through phone calls), then random forest imputation would be a good imputation technique for the missing covariates. The missingness of the data involves more complicated interactions over time since most of the covariates and the outcome have repeated measurements over time. Random forests are a good non-parametric learning technique which captures complex interactions between mixed type data. We propose a proximity imputation and missForest type covariate imputation with random splits while building the forest. The performance of the imputation techniques used is compared to existing techniques in various simulation settings.The Atherosclerosis Risk in Communities (ARIC) Study Cohort is a longitudinal study which started in 1987-1989 to collect data on participants across 4 states in the USA, aimed at studying the factors behind heart diseases. We consider patients at the 5th visit (occurred in 2013) and enrolled in continuous Medicare Fee-For-Service (FFS) insurance in the last 6 months prior to their visit so that their hospitalization diagnostic (ICD) codes are available. Our aim is to characterize the hospitalization of patients having cognitive status ascertainment (classified into dementia, mild cognitive disorder or no cognitive disorder) in the 5th visit. Diagnostic codes for inpatient and outpatient visits identified from CMS (Centers for Medicare & Medicaid Services) Medicare FFS data linked with ARIC participant data are stored in the form of International Classification of Diseases and related health problems (ICD) codes. We treat these codes as a bag-of-words model to apply text mining techniques and get meaningful cluster of ICD codes.
590
$a
School code: 0153.
650
4
$a
Biostatistics.
$3
1002712
650
4
$a
Statistics.
$3
517247
650
4
$a
Computer science.
$3
523869
650
4
$a
Artificial intelligence.
$3
516317
653
$a
Clustering
653
$a
Machinelearning
653
$a
Missing data imputation
653
$a
Nearest neighbors
653
$a
Random forests
653
$a
Text mining
690
$a
0308
690
$a
0463
690
$a
0984
690
$a
0800
710
2
$a
The University of North Carolina at Chapel Hill.
$b
Biostatistics.
$3
1023527
773
0
$t
Dissertations Abstracts International
$g
82-01B.
790
$a
0153
791
$a
Ph.D.
792
$a
2020
793
$a
English
856
4 0
$u
https://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27959806
筆 0 讀者評論
館藏地:
全部
電子資源
出版年:
卷號:
館藏
1 筆 • 頁數 1 •
1
條碼號
典藏地名稱
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
W9428581
電子資源
11.線上閱覽_V
電子書
EB
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
評論
新增評論
分享你的心得
Export
取書館
處理中
...
變更密碼
登入
(1)帳號:一般為「身分證號」;外籍生或交換生則為「學號」。 (2)密碼:預設為帳號末四碼。
帳號
.
密碼
.
請在此電腦上記得個人資料
取消
忘記密碼? (請注意!您必須已在系統登記E-mail信箱方能使用。)