計(jì)算機(jī)輔助醫(yī)學(xué)影像診斷中的關(guān)鍵學(xué)習(xí)技術(shù)研究
發(fā)布時(shí)間:2018-09-17 16:42
【摘要】:利用計(jì)算機(jī)技術(shù)輔助放射科醫(yī)生進(jìn)行病例診斷,即計(jì)算機(jī)輔助診斷(Computer Aided Diagnosis, CAD)在早期乳腺癌檢查中起到越來越重要的作用,能有效幫助減少乳腺癌患者的死亡率。臨床上已標(biāo)記病例樣本難以搜集同時(shí)陰性病例樣本數(shù)遠(yuǎn)大于陽性病例樣本數(shù),因而在CAD應(yīng)用中存在小樣本、非平衡數(shù)據(jù)集的學(xué)習(xí)問題。非平衡及小樣本學(xué)習(xí)問題是關(guān)于類別嚴(yán)重不對(duì)稱及信息欠充分表達(dá)數(shù)據(jù)集的學(xué)習(xí)性能問題。非平衡及小樣本學(xué)習(xí)在許多現(xiàn)實(shí)應(yīng)用中具有重要意義,盡管經(jīng)典機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘技術(shù)在許多實(shí)際應(yīng)用中取得很大成功,然而針對(duì)小樣本及非平衡數(shù)據(jù)的學(xué)習(xí)對(duì)于學(xué)者們來說仍然是一個(gè)很大的挑戰(zhàn)。本論文系統(tǒng)地闡述了機(jī)器學(xué)習(xí)在小樣本與非平衡學(xué)習(xí)環(huán)境下性能下降的主要原因,并就目前解決小樣本、非平衡學(xué)習(xí)問題的有效方法進(jìn)行了綜述。本論文在充分理解常用欠采樣方法在處理非平衡樣本時(shí)易于丟失類別信息的問題基礎(chǔ)上,重點(diǎn)研究如何合理、有效處理非平衡數(shù)據(jù)。論文提出兩種欠采樣新方法有效提取最富含類別信息的樣本以此解決欠采樣引起的類別信息丟失問題。另外針對(duì)小樣本學(xué)習(xí)問題,論文提出新的類別標(biāo)記算法。該算法通過自動(dòng)標(biāo)記未標(biāo)記樣本擴(kuò)大訓(xùn)練樣本集,同時(shí)有效減少標(biāo)記過程中易發(fā)生的標(biāo)記錯(cuò)誤。 本論文聚焦小樣本、非平衡數(shù)據(jù)的學(xué)習(xí)技術(shù)研究。圍繞非平衡數(shù)據(jù)集的重采樣及未標(biāo)記樣本的類別標(biāo)記等問題展開研究。論文的主要工作包括: (1)針對(duì)CAD應(yīng)用中標(biāo)記病例樣本難以收集所引起的小樣本學(xué)習(xí)問題,本論文利用大量存在的未標(biāo)記樣本來擴(kuò)充訓(xùn)練樣本集以此解決小樣本學(xué)習(xí)問題。然而樣本標(biāo)記過程中往往存在錯(cuò)誤類別標(biāo)記,誤標(biāo)記樣本如同噪聲會(huì)顯著降低學(xué)習(xí)性能。針對(duì)半監(jiān)督學(xué)習(xí)中的誤標(biāo)記問題,本論文提出混合類別標(biāo)記(Hybrid Class Labeling)算法,算法從幾何距離、概率分布及語義概念三個(gè)不同角度分別進(jìn)行類別標(biāo)記。三種標(biāo)記方法基于不同原理,具有顯著差異性。將三種標(biāo)記方法有一致標(biāo)記結(jié)果的未標(biāo)記樣本加入訓(xùn)練樣本集。為進(jìn)一步減少可能存在的誤標(biāo)記樣本對(duì)學(xué)習(xí)過程造成的不利影響,算法將偽標(biāo)記隸屬度引入SVM(Support Vector Machine)學(xué)習(xí)中,由隸屬度控制樣本對(duì)學(xué)習(xí)過程的貢獻(xiàn)程度;赨CI中Breast-cancer數(shù)據(jù)集的實(shí)驗(yàn)結(jié)果表明該算法能有效地解決小樣本學(xué)習(xí)問題。相比于單一的類別標(biāo)記技術(shù),該算法造成更少的錯(cuò)誤標(biāo)記樣本,得到顯著優(yōu)于其它算法的學(xué)習(xí)性能。 (2)針對(duì)常用欠采樣技術(shù)在采樣過程中往往會(huì)丟失有效類別信息的問題,本論文提出了基于凸殼(Convex Hull,CH)結(jié)構(gòu)的欠采樣新方法。數(shù)據(jù)集的凸殼是包含集合中所有樣本的最小凸集,所有樣本點(diǎn)都位于凸殼頂點(diǎn)構(gòu)成的多邊形或多面體內(nèi)。受凸殼的幾何特性啟發(fā),算法采樣大類樣本集得到其凸殼結(jié)構(gòu),以簡(jiǎn)約的凸殼頂點(diǎn)替代大類訓(xùn)練樣本達(dá)到平衡樣本集的目的。鑒于實(shí)際應(yīng)用中兩類樣本往往重疊,對(duì)應(yīng)凸殼也將重疊。此時(shí)采用凸殼來表征大類的邊界結(jié)構(gòu)對(duì)學(xué)習(xí)過程是一個(gè)挑戰(zhàn),容易引起過學(xué)習(xí)及學(xué)習(xí)機(jī)的泛化能力下降?紤]到縮減凸殼(Reduced Convex Hull,RCH)、縮放凸殼(Scaled Convex Hull,SCH)在凸殼縮減過程中帶來邊界信息丟失的問題,我們提出多層次縮減凸殼結(jié)構(gòu)(Hierarchy Reduced Convex Hull,HRCH)。受RCH與SCH結(jié)構(gòu)上存在顯著差異性及互補(bǔ)性的啟發(fā),我們將RCH與SCH進(jìn)行融合生成HRCH結(jié)構(gòu)。相比于其它縮減凸殼結(jié)構(gòu),HRCH包含更多樣、互補(bǔ)的類別信息,有效減少凸殼縮減過程中類別的信息丟失。算法通過選擇不同取值的縮減因子與縮放因子采樣大類,所得多個(gè)HRCH結(jié)構(gòu)分別與稀有類樣本組成訓(xùn)練樣本集。由此訓(xùn)練得多個(gè)學(xué)習(xí)機(jī),并通過集成學(xué)習(xí)產(chǎn)生最終分類器。通過與其它四種參考算法的實(shí)驗(yàn)對(duì)比分析,該算法表現(xiàn)出更好分類性能及魯棒性。 (3)針對(duì)欠采樣算法中類別信息的丟失問題,本論文進(jìn)一步提出基于反向k近鄰的欠采樣新方法,RKNN。相比于廣泛采用的k近鄰,反向k近鄰是基于全局的角度來檢查鄰域。任一點(diǎn)的反向k近鄰不僅與其周圍鄰近點(diǎn)有關(guān),也受數(shù)據(jù)集中的其余點(diǎn)影響。樣本集的數(shù)據(jù)分布改變會(huì)導(dǎo)致每個(gè)樣本點(diǎn)的反向最近鄰關(guān)系發(fā)生變化,它能整體反應(yīng)樣本集的完整分布結(jié)構(gòu)。利用反向最近鄰將樣本相鄰關(guān)系進(jìn)行傳遞的特點(diǎn),克服最近鄰查詢僅關(guān)注查詢點(diǎn)局部分布的缺陷。該算法針對(duì)大類樣本集,采用反向k最近鄰技術(shù)去除噪聲、不穩(wěn)定的邊界樣本及冗余樣本,保留最富含類別信息且可靠的樣本作為訓(xùn)練樣本。算法在平衡訓(xùn)練樣本的同時(shí)有效改善了欠采樣引起的類別信息丟失問題;赨CI中Breast-cancer數(shù)據(jù)集的實(shí)驗(yàn)結(jié)果驗(yàn)證了該算法解決非平衡學(xué)習(xí)問題的有效性。相比于基于k最近鄰的欠采樣方法,RKNN算法得到了更好的性能表現(xiàn)。
[Abstract]:Computer aided diagnosis (CAD) plays an increasingly important role in early breast cancer screening and can effectively help reduce the mortality of breast cancer patients. For the number of positive case samples, there are small sample, unbalanced data sets learning problems in CAD applications. Unbalanced and small sample learning problems are about the learning performance of data sets with serious class asymmetry and insufficient information representation. Machine learning and data mining have achieved great success in many practical applications, but learning from small samples and unbalanced data is still a great challenge for scholars. This paper systematically expounds the main reasons for the performance degradation of machine learning in small samples and unbalanced learning environments, and proposes solutions to these problems. In this paper, based on a thorough understanding of the problem that under-sampling methods are easy to lose class information when dealing with non-equilibrium samples, we focus on how to deal with non-equilibrium data reasonably and effectively. Two new methods of under-sampling are proposed to extract the richest class information effectively. In addition, a new class labeling algorithm is proposed to solve the problem of class information loss caused by under-sampling. This algorithm enlarges the training sample set by automatically labeling unlabeled samples and effectively reduces the labeling errors in the labeling process.
This dissertation focuses on the study of small sample, unbalanced data learning technology. It focuses on the resampling of unbalanced data sets and the labeling of unlabeled samples.
(1) In order to solve the problem of small sample learning caused by the difficulty in collecting labeled case samples in CAD applications, this paper uses a large number of unlabeled samples to expand the training sample set to solve the problem of small sample learning. To solve the problem of mislabeling in semi-supervised learning, Hybrid Class Labeling algorithm is proposed in this paper. The algorithm labels classes from three different perspectives: geometric distance, probability distribution and semantic concepts. Results Unlabeled samples were added to the training sample set. To further reduce the possible adverse effects of mislabeled samples on the learning process, pseudo-labeled membership was introduced into SVM (Support Vector Machine) learning, and the contribution of the samples to the learning process was controlled by the membership degree. The results show that the algorithm can effectively solve the problem of small sample learning. Compared with the single class labeling technique, the algorithm results in fewer error labeling samples, and the learning performance of the algorithm is significantly better than that of other algorithms.
(2) In order to solve the problem that under-sampling often loses valid class information in the process of sampling, a new method of under-sampling based on Convex Hull (CH) structure is proposed in this paper. Inspired by the geometric characteristics of convex hulls, the algorithm sampled a large class of samples to obtain their convex hulls, and replaced the large class of training samples with a simple convex hull vertex to achieve the goal of balancing the sample set. Considering the loss of boundary information caused by scaled convex hull (SCH) in the process of convex hull reduction, we propose a multi-level reduced convex hull (HRCH). Inspired by significant differences and complementarities in structure, we fuse RCH and SCH to generate HRCH structure. Compared with other reduced convex hull structures, HRCH contains more diverse and complementary class information, which can effectively reduce the loss of class information in convex hull reduction process. The algorithm samples large classes by choosing reduction factors and scaling factors with different values. The resulting HRCH structure consists of a set of training samples and a set of rare class samples, from which multiple learning machines are trained and the final classifier is generated by ensemble learning.
(3) Aiming at the loss of class information in the under-sampling algorithm, this paper proposes a new under-sampling method based on the reverse k-nearest neighbor, RKNN. Compared with the widely used k-nearest neighbor, the reverse k-nearest neighbor is based on the global perspective to check the neighborhood. The change of the data distribution of the sample set will lead to the change of the reverse nearest neighbor relation of each sample point, which can reflect the whole distribution structure of the sample set. The algorithm balances the training samples and effectively improves the problem of class information loss caused by under-sampling. The experimental results based on Breast-cancer data set in UCI show that the proposed algorithm is effective in reducing noise, unstable boundary samples and redundant samples. Compared with the k nearest neighbor method, the RKNN algorithm has better performance.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:R81-39
本文編號(hào):2246516
[Abstract]:Computer aided diagnosis (CAD) plays an increasingly important role in early breast cancer screening and can effectively help reduce the mortality of breast cancer patients. For the number of positive case samples, there are small sample, unbalanced data sets learning problems in CAD applications. Unbalanced and small sample learning problems are about the learning performance of data sets with serious class asymmetry and insufficient information representation. Machine learning and data mining have achieved great success in many practical applications, but learning from small samples and unbalanced data is still a great challenge for scholars. This paper systematically expounds the main reasons for the performance degradation of machine learning in small samples and unbalanced learning environments, and proposes solutions to these problems. In this paper, based on a thorough understanding of the problem that under-sampling methods are easy to lose class information when dealing with non-equilibrium samples, we focus on how to deal with non-equilibrium data reasonably and effectively. Two new methods of under-sampling are proposed to extract the richest class information effectively. In addition, a new class labeling algorithm is proposed to solve the problem of class information loss caused by under-sampling. This algorithm enlarges the training sample set by automatically labeling unlabeled samples and effectively reduces the labeling errors in the labeling process.
This dissertation focuses on the study of small sample, unbalanced data learning technology. It focuses on the resampling of unbalanced data sets and the labeling of unlabeled samples.
(1) In order to solve the problem of small sample learning caused by the difficulty in collecting labeled case samples in CAD applications, this paper uses a large number of unlabeled samples to expand the training sample set to solve the problem of small sample learning. To solve the problem of mislabeling in semi-supervised learning, Hybrid Class Labeling algorithm is proposed in this paper. The algorithm labels classes from three different perspectives: geometric distance, probability distribution and semantic concepts. Results Unlabeled samples were added to the training sample set. To further reduce the possible adverse effects of mislabeled samples on the learning process, pseudo-labeled membership was introduced into SVM (Support Vector Machine) learning, and the contribution of the samples to the learning process was controlled by the membership degree. The results show that the algorithm can effectively solve the problem of small sample learning. Compared with the single class labeling technique, the algorithm results in fewer error labeling samples, and the learning performance of the algorithm is significantly better than that of other algorithms.
(2) In order to solve the problem that under-sampling often loses valid class information in the process of sampling, a new method of under-sampling based on Convex Hull (CH) structure is proposed in this paper. Inspired by the geometric characteristics of convex hulls, the algorithm sampled a large class of samples to obtain their convex hulls, and replaced the large class of training samples with a simple convex hull vertex to achieve the goal of balancing the sample set. Considering the loss of boundary information caused by scaled convex hull (SCH) in the process of convex hull reduction, we propose a multi-level reduced convex hull (HRCH). Inspired by significant differences and complementarities in structure, we fuse RCH and SCH to generate HRCH structure. Compared with other reduced convex hull structures, HRCH contains more diverse and complementary class information, which can effectively reduce the loss of class information in convex hull reduction process. The algorithm samples large classes by choosing reduction factors and scaling factors with different values. The resulting HRCH structure consists of a set of training samples and a set of rare class samples, from which multiple learning machines are trained and the final classifier is generated by ensemble learning.
(3) Aiming at the loss of class information in the under-sampling algorithm, this paper proposes a new under-sampling method based on the reverse k-nearest neighbor, RKNN. Compared with the widely used k-nearest neighbor, the reverse k-nearest neighbor is based on the global perspective to check the neighborhood. The change of the data distribution of the sample set will lead to the change of the reverse nearest neighbor relation of each sample point, which can reflect the whole distribution structure of the sample set. The algorithm balances the training samples and effectively improves the problem of class information loss caused by under-sampling. The experimental results based on Breast-cancer data set in UCI show that the proposed algorithm is effective in reducing noise, unstable boundary samples and redundant samples. Compared with the k nearest neighbor method, the RKNN algorithm has better performance.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:R81-39
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 楊風(fēng)召,朱揚(yáng)勇;一種有效的量化交易數(shù)據(jù)相似性搜索方法[J];計(jì)算機(jī)研究與發(fā)展;2004年02期
2 沈曄;李敏丹;夏順仁;;計(jì)算機(jī)輔助乳腺癌診斷中的非平衡學(xué)習(xí)技術(shù)[J];浙江大學(xué)學(xué)報(bào)(工學(xué)版);2013年01期
,本文編號(hào):2246516
本文鏈接:http://www.sikaile.net/yixuelunwen/yundongyixue/2246516.html
最近更新
教材專著