Relief特征選擇與混合核SVM在疾病診斷中的研究
發(fā)布時(shí)間:2018-01-19 16:07
本文關(guān)鍵詞: Relief特征選擇 SVM 組合優(yōu)化 混合核函數(shù) 出處:《太原理工大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:醫(yī)學(xué)診斷,是指醫(yī)生給病人檢查疾病,并對(duì)病人疾病的病因、發(fā)病機(jī)制作出分類鑒別,以此作為制定治療方案的方法和途徑。這本質(zhì)上是一個(gè)分類過程,也稱模式識(shí)別。現(xiàn)有的分類方法有支持向量機(jī)(Support Vector Machine,SVM)、K鄰近(K-Nearest Neighbor,KNN)、神經(jīng)網(wǎng)絡(luò)(Neural Network,NN)和決策樹算法等。SVM對(duì)小樣本、非線性及高維數(shù)據(jù)的模式識(shí)別問題具有很好的魯棒性,具有較好的識(shí)別能力與適應(yīng)能力。SVM在構(gòu)建分類模型的過程中,所表現(xiàn)出的對(duì)訓(xùn)練樣本的學(xué)習(xí)能力與對(duì)測(cè)試數(shù)據(jù)的推廣性能主要由三種因素決定:原始數(shù)據(jù)集的處理、所選擇的核函數(shù)以及核函數(shù)的參數(shù)。目前SVM在分類過程中存在的主要問題有:(1)目前SVM均采用單一核函數(shù),其核函數(shù)可以分為全局核函數(shù)與局部核函數(shù)兩種。全局核函數(shù)具有推廣性能強(qiáng)而學(xué)習(xí)能力弱的特點(diǎn),而局部核函數(shù)的學(xué)習(xí)能力強(qiáng)、但是推廣性能弱。所以SVM分類結(jié)果往往無法同時(shí)滿足較高的學(xué)習(xí)能力與推廣性能。(2)在SVM參數(shù)的選擇方面,主要有兩種方法:傳統(tǒng)的網(wǎng)格搜索法與啟發(fā)式算法。網(wǎng)格搜索法特點(diǎn)是總能找到最優(yōu)解,但是耗時(shí)、效率低;啟發(fā)式算法查找速度快,但是解的精度不及網(wǎng)格搜索法高,并且遺傳算法只是概率得到最優(yōu)解。為了提高SVM的分類性能,本文主要在以下幾方面進(jìn)行研究:(1)選用Relief算法進(jìn)行特征選擇。在疾病診斷中,病人所表現(xiàn)出的多種臨床特征與疾病的相關(guān)性是不同的,醫(yī)生無法具體量化每個(gè)特征與疾病的關(guān)聯(lián)度。因此,為了更準(zhǔn)確地進(jìn)行診斷,需要用特征選擇算法計(jì)算出每個(gè)特征的權(quán)重,也就是各個(gè)臨床癥狀與所患疾病的關(guān)聯(lián)度;(2)提出將全局與局部兩種核函數(shù)進(jìn)行線性結(jié)合,構(gòu)造學(xué)習(xí)能力與推廣性能都得到提高的混合核函數(shù);(3)對(duì)核函數(shù)參數(shù)進(jìn)行組合尋優(yōu),先使用啟發(fā)式算法中的遺傳算法快速查找到最優(yōu)解的大致范圍,再使用網(wǎng)格搜索法在該小范圍內(nèi)進(jìn)行二次精確搜索,不僅可以大大減少網(wǎng)格搜索法的計(jì)算時(shí)間,找到的解也比遺傳算法更優(yōu)。本文使用Matlab R2015b及臺(tái)灣林智仁教授開發(fā)的LIBSVM工具包進(jìn)行建模,分析了Matlab開發(fā)環(huán)境、LIBSVM工具包的接口配置、如何設(shè)置核函數(shù)及其參數(shù)、如何構(gòu)造混合核函數(shù)以及如何進(jìn)行參數(shù)的組合尋優(yōu)。并以公共數(shù)據(jù)集UCI中的Heart disease數(shù)據(jù)集及Breast cancer數(shù)據(jù)集為應(yīng)用背景,進(jìn)行疾病診斷模型的構(gòu)建與驗(yàn)證。
[Abstract]:Medical diagnosis means that the doctor examines the patient and classifies the etiology and pathogenesis of the disease as a method and approach to make a treatment plan. This is essentially a classification process. Also known as pattern recognition. The existing classification methods are support Vector Machine (SVM). K-nearest neighbor, neural network, decision tree algorithm, etc. The pattern recognition problem of nonlinear and high-dimensional data is robust and has good recognition ability and adaptability. SVM is used to construct classification model. The learning ability of the training sample and the generalization performance of the test data are mainly determined by three factors: the processing of the original data set. The kernel function selected and the parameters of the kernel function. At present, the main problem existing in the classification of SVM is: 1) at present, SVM uses a single kernel function. The kernel function can be divided into global kernel function and local kernel function. The global kernel function has the characteristics of strong generalization performance and weak learning ability, while the local kernel function has strong learning ability. But the generalization performance is weak, so SVM classification results often can not meet the higher learning ability and extension performance. 2) in the choice of SVM parameters. There are two main methods: the traditional grid search method and the heuristic algorithm. The characteristic of the grid search method is that it can always find the optimal solution, but it is time-consuming and inefficient. Heuristic algorithm is fast, but the accuracy of the solution is not as high as the grid search method, and the genetic algorithm is only probability to get the optimal solution. In order to improve the classification performance of SVM. This paper mainly studies the following several aspects: 1) choose the Relief algorithm for feature selection. In the diagnosis of disease, the patients show different clinical characteristics and the correlation between the disease. Doctors can not quantify the correlation between each feature and disease. Therefore, in order to diagnose more accurately, it is necessary to calculate the weight of each feature by feature selection algorithm. Namely each clinical symptom and the disease that suffer from the correlation degree; (2) A new hybrid kernel function is proposed, which combines global and local kernel functions linearly and constructs hybrid kernel functions with improved learning ability and extended performance. Firstly, the genetic algorithm in the heuristic algorithm is used to find the approximate range of the optimal solution quickly, and then the grid search method is used to carry out the quadratic accurate search in the small range. Not only can the computing time of grid search method be greatly reduced. The solution is also better than genetic algorithm. In this paper, Matlab R2015b and LIBSVM toolkits developed by Professor Lin Zhiren of Taiwan are used to model and analyze the Matlab development environment. Interface configuration for the LIBSVM toolkit, how to set the kernel function and its parameters. How to construct the mixed kernel function and how to optimize the combination of parameters. The Heart disease data set and Breast in the common data set UCI. Cancer data set is the application background. To construct and verify the disease diagnosis model.
【學(xué)位授予單位】:太原理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:R44;TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 焦敬品;李勇強(qiáng);吳斌;何存富;;基于BP神經(jīng)網(wǎng)絡(luò)的管道泄漏聲信號(hào)識(shí)別方法研究[J];儀器儀表學(xué)報(bào);2016年11期
2 李U,
本文編號(hào):1444879
本文鏈接:http://www.sikaile.net/linchuangyixuelunwen/1444879.html
最近更新
教材專著