基于Spark的一種改進的隨機森林算法研究
本文選題:隨機森林 + 分類精度 ; 參考:《太原理工大學》2017年碩士論文
【摘要】:隨機森林算法是一種具有優(yōu)秀分類性能的機器學習算法,它具有擅長處理大規(guī)模數據集、可以處理多達幾千個屬性的數據集、需要調整的參數少、不會出現過擬合等特點。因此隨機森林算法在各個領域都得到了廣泛的應用和發(fā)展,吸引了大量的學者對其進行改進和研究,并取得了豐碩的成果。但是傳統隨機森林算法在生成隨機森林模型的過程中,一是生成的決策樹模型在分類性能上參差不齊,二是決策樹模型之間會有相關性,那些分類性能差的決策樹以及相互之間相關性強的決策樹會對隨機森林模型的整體分類性能產生消極的影響。本文針對傳統隨機森林的這兩個特性,提出了一種基于分類精度和相似度的改進的隨機森林算法。該算法選用分類性能評價指標AUC值對隨機森林模型中的決策樹模型的分類性能進行評判,選出其中分類性能在設定閾值之上的決策樹模型;然后對選出的分類性能好的決策樹模型進行相似度計算,得到這些決策樹模型之間的相似度矩陣,因為相似度高的決策樹,他們之間的相關性就高,所以再根據相似度矩陣和相似度評判標準對這些決策樹模型進行聚類;最后選出每一個聚類中AUC值最高的決策樹作為這一個聚類的代表,從而組成新的隨機森林模型。通過對心臟病、乳腺癌、Pima印第安人糖尿病和印度肝病等UCI數據集的測試結果表明,本文提出的基于分類精度和相關性的改進的隨機森林算法比傳統的隨機森林算法在分類精度上有了一定的提升。本文先在MATLAB平臺上對改進的隨機森林算法進行了實現,然后通過設計實驗在四個UCI數據集上對改進的隨機森林算法和傳統的隨機森林算法在分類精度上進行了比較,結果表明改進的隨機森林算法在分類精度上有了一定的提升,但是由于相比傳統的隨機森林算法,改進的隨機森林算法多了兩個優(yōu)化步驟,所以在分類速率上會有所下降,而且單機的MATLAB平臺對于較大型數據的處理和迭代速度會非常緩慢,因此最終又在Spark平臺上對改進的隨機森林算法進行了實現,使得改進的隨機森林算法的分類速率有了較大的提升。
[Abstract]:Stochastic forest algorithm is a machine learning algorithm with excellent classification performance. It is good at dealing with large scale data sets and can handle data sets with thousands of attributes. Therefore, stochastic forest algorithm has been widely used and developed in various fields, attracting a large number of scholars to improve and study it, and has achieved fruitful results. However, in the process of generating stochastic forest model, the traditional stochastic forest algorithm, one is that the decision tree model is different in classification performance, the other is the correlation between the decision tree model and the decision tree model. Those decision trees with poor classification performance and decision trees with strong correlation will have a negative impact on the overall classification performance of stochastic forest models. In this paper, an improved stochastic forest algorithm based on classification accuracy and similarity is proposed. In this algorithm, the classification performance of the decision tree model in the stochastic forest model is evaluated by AUC, and the decision tree model with the classification performance above the threshold is selected. Then, the similarity of the decision tree models with good classification performance is calculated, and the similarity matrix between these decision tree models is obtained. Because the decision trees with high similarity, the correlation between them is high. According to the similarity matrix and similarity evaluation criteria, these decision tree models are clustered. Finally, the decision tree with the highest AUC value in each cluster is selected as the representative of this cluster, and a new stochastic forest model is formed. Tests on UCI data sets such as heart disease, breast cancer, Pima Indian diabetes and Indian liver disease showed that, The improved stochastic forest algorithm based on classification accuracy and correlation is better than the traditional stochastic forest algorithm in classification accuracy. In this paper, the improved stochastic forest algorithm is implemented on the MATLAB platform, and the classification accuracy of the improved stochastic forest algorithm is compared with that of the traditional stochastic forest algorithm on four UCI datasets. The results show that the improved stochastic forest algorithm has a certain improvement in classification accuracy, but compared with the traditional stochastic forest algorithm, the improved stochastic forest algorithm has two more optimization steps, so the classification rate will be reduced. Moreover, the processing and iterative speed of the larger data on the single MATLAB platform will be very slow, so the improved stochastic forest algorithm is implemented on the Spark platform. The classification rate of the improved stochastic forest algorithm is greatly improved.
【學位授予單位】:太原理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP181
【參考文獻】
相關期刊論文 前5條
1 馬春來;單洪;馬濤;史英春;;隨機森林改進算法在LBS用戶社會關系推斷中的應用[J];小型微型計算機系統;2016年12期
2 陳松景;楊林;吳思竹;李姣;;基于C4.5分類的呼吸系統疾病危險因素定量分析方法[J];中華醫(yī)學圖書情報雜志;2016年08期
3 張宇航;;微博社交網絡數據挖掘與用戶權重分析[J];中國高新技術企業(yè);2016年05期
4 李定啟;程遠平;王海峰;王亮;周紅星;孫建華;;基于決策樹ID3改進算法的煤與瓦斯突出預測[J];煤炭學報;2011年04期
5 鄭煒;沈文;張英鵬;;基于改進樸素貝葉斯算法的垃圾郵件過濾器的研究[J];西北工業(yè)大學學報;2010年04期
相關博士學位論文 前1條
1 隋學深;基于時間序列數據挖掘的股票市場價格行為研究[D];哈爾濱工業(yè)大學;2008年
相關碩士學位論文 前6條
1 車晉強;基于Spark平臺的高血壓藥物推薦及療效預測研究[D];太原理工大學;2016年
2 陳秀芬;基于文獻挖掘的中藥治療糖尿病用藥篩選及作用機制研究[D];北京中醫(yī)藥大學;2016年
3 萬飛;基于網格搜索的支持向量機在入侵檢測中的應用[D];合肥工業(yè)大學;2015年
4 陳金佑;數據挖掘在股票分析中的應用研究[D];華南理工大學;2014年
5 李貞貴;隨機森林改進的若干研究[D];廈門大學;2013年
6 盧明泰;WEB數據挖掘及其在社交網絡的應用研究[D];電子科技大學;2012年
,本文編號:1865699
本文鏈接:http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/1865699.html