基于Spark分布式平臺的隨機森林分類算法研究

發(fā)布時間：2018-03-25 23:15

本文選題：高維大數(shù)據(jù)　切入點：分類　出處：《中國民航大學(xué)》2017年碩士論文

【摘要】：信息技術(shù)及網(wǎng)絡(luò)的高速發(fā)展,帶來了大量高維復(fù)雜數(shù)據(jù),如何有效地對這些數(shù)據(jù)進行分類以挖掘出有價值的信息是具有重大意義的課題。隨機森林是一種重要的分類算法,對噪聲和異常值有較好的容忍性,能夠適用于并行化。原始隨機森林分類算法及其改進算法多是運行在單機上,當(dāng)它們面對大量高維復(fù)雜數(shù)據(jù)時,時間效率和空間資源都已無法滿足實際需求。Spark是一種高效的分布式計算框架,能夠提供性能與速率兼并的并行運算,是解決這一問題的有效方法。高維數(shù)據(jù)的很多特征信息量少、與類別的相關(guān)性弱,影響了隨機森林的分類正確率。因此,論文在Spark平臺上改進隨機森林算法以提高大數(shù)據(jù)時代分類高維數(shù)據(jù)的有效性。首先,隨機森林算法在集成決策樹和進行分類決策時,無法區(qū)別對待每一棵決策樹,導(dǎo)致分類能力弱的決策樹會影響算法整體的分類性能。針對此問題,提出一種權(quán)重樹隨機森林算法,并在Spark平臺上實現(xiàn)該算法。算法采用權(quán)重樹集成策略,能夠加強分類能力強的樹對于分類決策的影響,同時削弱分類能力弱的樹對分類決策的影響,提高隨機森林整體的分類能力。實驗結(jié)果表明,相比原始隨機森林算法,所提算法分類正確率更高,可擴展性良好,能夠有效分類高維大數(shù)據(jù)。其次,隨機森林算法在結(jié)點處生成特征子空間時,所采用的簡單隨機抽樣會導(dǎo)致生成的特征子空間中往往含有很多分類能力弱的特征,影響了隨機森林算法的分類性能。針對此問題,通過改進分層子空間的實施方式,提出了一種分層子空間隨機森林算法,并在Spark平臺上實現(xiàn)該算法。改進的實施方式既保證了特征分層結(jié)果的正確性,又降低了計算成本,適合高維大數(shù)據(jù)。實驗結(jié)果驗證了所提算法能夠有效分類高維大數(shù)據(jù)。相比原始隨機森林算法,所提算法具有更高的分類正確率和更好的泛化能力,可擴展性良好。最后,將權(quán)重樹隨機森林算法和分層子空間隨機森林算法應(yīng)用于航班延誤的預(yù)測中,在對數(shù)據(jù)集特征的詳細(xì)信息進行分析的基礎(chǔ)上,通過最小-最大規(guī)范化和延誤等級劃分對數(shù)據(jù)進行預(yù)處理,實驗驗證了權(quán)重樹隨機森林算法和分層子空間隨機森林算法能夠有效分類和預(yù)測航班延誤的延誤等級。
[Abstract]:The rapid development of information technology and network has brought a large number of high-dimensional complex data. How to effectively classify these data to mine valuable information is of great significance. Random forest is an important classification algorithm. It has good tolerance for noise and outliers, and can be applied to parallelization. The original stochastic forest classification algorithms and their improved algorithms are mostly run on a single computer, when they face a large number of high dimensional complex data, Both time efficiency and space resources can no longer meet the actual demand. Park is an efficient distributed computing framework that provides parallel computation of performance and rate annexation. It is an effective method to solve this problem. Many features of high-dimensional data have little information and weak correlation with category, which affects the classification accuracy of random forest. In order to improve the effectiveness of classifying high-dimensional data in big data's time, this paper improves the stochastic forest algorithm on Spark platform. Firstly, the stochastic forest algorithm can not treat each decision tree differently when it integrates decision trees and makes classification decisions. The decision tree with weak classification ability will affect the whole classification performance of the algorithm. In order to solve this problem, a weighted tree stochastic forest algorithm is proposed and implemented on Spark platform. The effect of trees with strong classification ability on classification decision is strengthened, and the influence of trees with weak classification ability on classification decision is weakened. The experimental results show that compared with the original stochastic forest algorithm, the classification ability of the whole stochastic forest is improved. The proposed algorithm is more accurate and extensible, and can effectively classify high dimensional big data. Secondly, when the stochastic forest algorithm generates feature subspace at the node, The simple random sampling will lead to many features with weak classification ability in the generated feature subspace, which affects the classification performance of the stochastic forest algorithm. In order to solve this problem, the implementation of the hierarchical subspace is improved. A hierarchical subspace random forest algorithm is proposed and implemented on the Spark platform. The improved implementation not only ensures the correctness of the feature stratification results, but also reduces the computational cost. The experimental results show that the proposed algorithm can effectively classify the high-dimensional big data. Compared with the original stochastic forest algorithm, the proposed algorithm has higher classification accuracy and better generalization ability. Finally, The weighted tree stochastic forest algorithm and hierarchical subspace stochastic forest algorithm are applied to the prediction of flight delay. On the basis of analyzing the detailed information of the feature of the data set, the weight tree random forest algorithm and the hierarchical subspace random forest algorithm are applied to the prediction of flight delay. The data are preprocessed by minimum-maximum normalization and delay classification. The experimental results show that the weighted tree stochastic forest algorithm and the hierarchical subspace stochastic forest algorithm can effectively classify and predict the delay level of flight delays.
【學(xué)位授予單位】：中國民航大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP181

【參考文獻】

相關(guān)期刊論文前4條

1 丁君美;劉貴全;李慧;;改進隨機森林算法在電信業(yè)客戶流失預(yù)測中的應(yīng)用[J];模式識別與人工智能;2015年11期

2 姚明煌;駱炎民;;改進的隨機森林及其在遙感圖像中的應(yīng)用[J];計算機工程與應(yīng)用;2016年04期

3 房曉南;張化祥;高爽;;基于SMOTE和隨機森林的Web spam檢測[J];山東大學(xué)學(xué)報(工學(xué)版);2013年01期

4 張華偉;王明文;甘麗新;;基于隨機森林的文本分類模型研究[J];山東大學(xué)學(xué)報(理學(xué)版);2006年03期

相關(guān)博士學(xué)位論文前1條

1 曹正鳳;隨機森林算法優(yōu)化研究[D];首都經(jīng)濟貿(mào)易大學(xué);2014年

相關(guān)碩士學(xué)位論文前8條

1 羅元帥;基于隨機森林和Spark的并行文本分類算法研究[D];西南交通大學(xué);2016年

2 王雪;面向高維不平衡數(shù)據(jù)的隨機森林算法及其并行化研究[D];遼寧大學(xué);2016年

3 蔣昆佑;基于Spark的海量數(shù)據(jù)計算平臺設(shè)計與實現(xiàn)[D];大連理工大學(xué);2016年

4 陳英芝;Spark Shuffle的內(nèi)存調(diào)度算法分析及優(yōu)化[D];浙江大學(xué);2016年

5 劉鵬;基于Spark的數(shù)據(jù)管理平臺的設(shè)計與實現(xiàn)[D];浙江大學(xué);2016年

6 唐振坤;基于Spark的機器學(xué)習(xí)平臺設(shè)計與實現(xiàn)[D];廈門大學(xué);2014年

7 馮琳;集群計算引擎Spark中的內(nèi)存優(yōu)化研究與實現(xiàn)[D];清華大學(xué);2013年

8 雍凱;隨機森林的特征選擇和模型優(yōu)化算法研究[D];哈爾濱工業(yè)大學(xué);2008年

，

本文編號：1665285

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/1665285.html

上一篇：變頻調(diào)速系統(tǒng)效率優(yōu)化控制
下一篇：基于薄板搭接的擺動式相鄰電容焊縫跟蹤傳感器

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Spark分布式平臺的隨機森林分類算法研究