不同缺失值處理技術(shù)的模擬比較
本文選題:缺失值 切入點(diǎn):模擬技術(shù) 出處:《鄭州大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:目的 在艾滋病中醫(yī)證候研究領(lǐng)域,數(shù)據(jù)缺失現(xiàn)象普遍存在。數(shù)據(jù)缺失會增加分析的復(fù)雜性,造成結(jié)果偏倚等一系列的問題。探索適合該數(shù)據(jù)庫的缺失值填充方法是進(jìn)行數(shù)據(jù)分析前迫切需要解決的問題。本研究以中醫(yī)證侯現(xiàn)場調(diào)查數(shù)據(jù)為基礎(chǔ),通過數(shù)據(jù)模擬技術(shù),比較不同的處理方法的優(yōu)劣,探討各自適用性,確定MI法的最佳填補(bǔ)次數(shù),探索不同的缺失模式和缺失機(jī)制下,最為準(zhǔn)確、高效、方便的處理方法。 方法 利用SAS9.1,模擬出完整數(shù)據(jù)集和不同缺失率的數(shù)據(jù)集,對于完全隨機(jī)缺失和隨機(jī)缺失的連續(xù)變量,采用期望最大化法(expectation maximization, EM)、回歸法、均值填補(bǔ)法、成組刪除法、多重填補(bǔ)法(multiple imputation, MI)進(jìn)行填補(bǔ),比較不同方法處理后的精確度、準(zhǔn)確度以及均值。二分類變量,采用成組刪除法和MI中的logistic回歸進(jìn)行填補(bǔ),比較不同方法處理后的回歸系數(shù)以及標(biāo)準(zhǔn)誤。 結(jié)果 1.連續(xù)變量:本資料的數(shù)據(jù)均為任意缺失模式,隨著填充次數(shù)的增加,填充效率逐漸增加,在MI填充10次時填充效率均達(dá)到0.95以上。精確度也伴隨著填充次數(shù)的增加而逐漸增加,填充10次后精確度最高。關(guān)于準(zhǔn)確度,缺失20%以下時,只需較少的填充次數(shù)(3-5次),就能達(dá)到較高的準(zhǔn)確度;缺失率30-40%時,MI填充10次的準(zhǔn)確度相對較高;缺失50%以上時,準(zhǔn)確度不穩(wěn)定。 2.完全隨機(jī)缺失機(jī)制:缺失10%以下時,任何一種方法處理后,都與完整數(shù)據(jù)集均值一致,MI法的精確度和準(zhǔn)確度最高。缺失20%以上時,采用成組刪除法和MI法效果優(yōu)于其他方法,MI法的精確度高,成組刪除法的準(zhǔn)確度高。 3.隨機(jī)缺失機(jī)制:缺失較少時(10%-20%),采用MI法準(zhǔn)確度、精確度高于其他方法。缺失30%時,采用成組刪除法處理后的準(zhǔn)確度高,但是精確度較差。缺失較多(缺失率40%)時,所有方法填充效果均不佳。 4.二分類變量,缺失較少(缺失率40%)時,采用成組刪除法簡單易行、準(zhǔn)確、高效,而MI法程序比較復(fù)雜,需占用較大內(nèi)存和時間進(jìn)行反復(fù)填補(bǔ),且結(jié)果不如成組刪除法。缺失40%-50%時,采用MI/logistic回歸法,只需較少的填補(bǔ)次數(shù)(2次)即可達(dá)到較好的效果。缺失率60%以上時,兩種方法的處理效果均不好。 結(jié)論 對于大樣本連續(xù)型變量資料,可認(rèn)為服從正態(tài)分布,可容許的缺失比例在30%以下。傳統(tǒng)的缺失值處理方法,如均值填補(bǔ)法和成組刪除法簡單、方便,具有一定的優(yōu)勢,但是MI法更能夠解決相對比較普遍的問題,發(fā)揮優(yōu)勢的空間更大,方便了人們對絕大多數(shù)類型的缺失值進(jìn)行填補(bǔ),填補(bǔ)效率較高。
[Abstract]:Purpose. In the research field of TCM syndrome of AIDS, the phenomenon of missing data is common. Missing data will increase the complexity of analysis. A series of problems are caused by bias of results. It is urgent to solve the problem before data analysis by exploring the filling method of missing value suitable for this database. This study is based on the data of field investigation of TCM syndrome and is based on data simulation technology. Compare the advantages and disadvantages of different methods, discuss their applicability, determine the best filling times of MI method, explore the most accurate, efficient and convenient processing methods under different missing modes and mechanisms. Method. The complete data sets and data sets with different deletion rates were simulated by using SAS9.1. For the continuous variables with complete random deletions and random deletions, the expectation maximization method, EMU, regression method, mean filling method, group deletion method were used. Multiple multiple imputation (MII) method was used to fill, compare the accuracy, accuracy and mean value of two classifiable variables treated by different methods, and use group deletion method and logistic regression in MI to fill. The regression coefficient and standard error of different methods were compared. Results. 1. Continuous variables: the data in this data are arbitrary missing patterns, and the filling efficiency increases with the increase of filling times. When MI fills 10 times, the filling efficiency is more than 0.95. The accuracy increases gradually with the increase of filling times, and the accuracy is the highest after filling 10 times. For accuracy, when the accuracy is less than 20%, The accuracy of MI filling is relatively high when the missing rate is 30-40%, and the accuracy is unstable when the missing rate is more than 50%. 2. Complete random deletion mechanism: when missing below 10%, either method has the same accuracy and accuracy as the average of the complete data set. When missing more than 20%, the MI method has the highest accuracy and accuracy. The accuracy of group deletion method and MI method is higher than that of other methods, and the accuracy of group deletion method is higher than that of other methods. 3. Random deletion mechanism: when there are fewer deletions, the accuracy of MI method is higher than that of other methods. When missing 30, the accuracy of group deletion method is high, but the accuracy is poor. The filling effect of all methods is not good. 4. In the case of two classified variables with fewer deletions (the deletion rate is 40%), the method of group deletion is simple, accurate and efficient, while the MI method is more complicated and requires a large amount of memory and time to be filled repeatedly. The results were not as good as the group deletion method. When the deletion rate was 40% -50%, the MI/logistic regression method was used, only two times of filling were needed to achieve a better effect. When the deletion rate was more than 60%, the treatment effect of both methods was not good. Conclusion. For the data of large sample of continuous variables, it can be considered that the acceptable missing ratio is less than 30% from normal distribution. The traditional methods of processing missing values, such as mean value filling method and group deletion method, are simple, convenient and have certain advantages. But the MI method can solve the relatively common problems, and the space of exerting advantages is bigger, which makes it convenient for people to fill the missing value of most types, and the filling efficiency is higher.
【學(xué)位授予單位】:鄭州大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:R181.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 游曉鋒;丁樹良;劉紅云;;缺失數(shù)據(jù)的估計(jì)方法及應(yīng)用[J];江西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年03期
2 張國毅;宋德亮;王長宇;李冬梅;;相位差變化率定位法中缺失值精確填補(bǔ)研究[J];吉林大學(xué)學(xué)報(bào)(信息科學(xué)版);2010年01期
3 劉超;石冰;;一種基于相關(guān)函數(shù)法的奇異值補(bǔ)值方法[J];測試技術(shù)學(xué)報(bào);2010年04期
4 霍忠誠;曾玲;范婷;;一種新的基于序數(shù)型不完備信息系統(tǒng)的粗糙集方法[J];桂林電子科技大學(xué)學(xué)報(bào);2010年04期
5 李琳琳;楊永利;施學(xué)忠;時松和;馬瑩瑩;劉愛華;謝世平;;HIV/AIDS患者中醫(yī)四診信息的主成分分析[J];鄭州大學(xué)學(xué)報(bào)(醫(yī)學(xué)版);2007年04期
6 王愛英;楊永利;施學(xué)忠;;艾滋病對河南省居民期望壽命的影響[J];鄭州大學(xué)學(xué)報(bào)(醫(yī)學(xué)版);2008年04期
7 花琳琳;施念;楊永利;趙天儀;施學(xué)忠;;不同缺失值處理方法對隨機(jī)缺失數(shù)據(jù)處理效果的比較[J];鄭州大學(xué)學(xué)報(bào)(醫(yī)學(xué)版);2012年03期
8 茅群霞,李曉松;多重填補(bǔ)法Markov Chain Monte Carlo模型在有缺失值的婦幼衛(wèi)生縱向數(shù)據(jù)中的應(yīng)用[J];四川大學(xué)學(xué)報(bào)(醫(yī)學(xué)版);2005年03期
9 李宏;阿瑪尼;李平;吳敏;;基于EM和貝葉斯網(wǎng)絡(luò)的丟失數(shù)據(jù)填充算法[J];計(jì)算機(jī)工程與應(yīng)用;2010年05期
10 潘立強(qiáng);李建中;駱吉洲;;傳感器網(wǎng)絡(luò)中一種基于時-空相關(guān)性的缺失值估計(jì)算法[J];計(jì)算機(jī)學(xué)報(bào);2010年01期
相關(guān)碩士學(xué)位論文 前3條
1 劉志永;基于非隨機(jī)缺失機(jī)制的模式混合模型醫(yī)學(xué)應(yīng)用研究[D];山西醫(yī)科大學(xué);2011年
2 茅群霞;缺失值處理統(tǒng)計(jì)方法的模擬比較研究及應(yīng)用[D];四川大學(xué);2005年
3 朱曼龍;最近鄰方法在填充和分類中應(yīng)用的新技術(shù)[D];廣西師范大學(xué);2010年
,本文編號:1557582
本文鏈接:http://www.sikaile.net/yixuelunwen/yufangyixuelunwen/1557582.html