面向離散屬性的決策樹分類方法研究
發(fā)布時(shí)間:2018-05-21 05:16
本文選題:數(shù)據(jù)挖掘 + 決策樹; 參考:《大連海事大學(xué)》2017年碩士論文
【摘要】:數(shù)據(jù)挖掘是指在大量已存在的數(shù)據(jù)中發(fā)現(xiàn)規(guī)律的一個(gè)過(guò)程。近年來(lái),在大量數(shù)據(jù)中智能提取知識(shí)已經(jīng)引起了業(yè)界廣泛的關(guān)注。數(shù)據(jù)挖掘領(lǐng)域包括分類、聚類、聚簇、關(guān)聯(lián)分析等各種挖掘方法。決策樹算法因它提取知識(shí)簡(jiǎn)單、高效、易于理解等優(yōu)點(diǎn),在數(shù)據(jù)挖掘領(lǐng)域中占有無(wú)可替代的地位。在已有的決策樹算法中,計(jì)算決策樹分裂結(jié)點(diǎn)的標(biāo)準(zhǔn)大多以香農(nóng)的信息熵為基礎(chǔ),信息熵需反復(fù)地進(jìn)行對(duì)數(shù)運(yùn)算,分類效率不高。又因已有算法在選擇候選結(jié)點(diǎn)時(shí)的隨機(jī)性,使分類器無(wú)法進(jìn)一步選擇判斷屬性分裂標(biāo)準(zhǔn)相同時(shí)的情況,進(jìn)而降低預(yù)測(cè)分類的準(zhǔn)確率。本文針對(duì)已有決策樹算法的缺點(diǎn),提出以下改進(jìn):(1)本文針對(duì)已有決策樹算法分類效率不高的問題,為避免復(fù)雜的對(duì)數(shù)運(yùn)算,提高CPU的利用率,提出了改進(jìn)的屬性判斷標(biāo)準(zhǔn)的優(yōu)化函數(shù)。對(duì)比實(shí)驗(yàn)顯示該優(yōu)化函數(shù)能有效提高分類效率和CPU的利用率。(2)本文針對(duì)生成后的決策樹分類器精確率低的問題,為避免當(dāng)兩個(gè)或更多的屬性判斷標(biāo)準(zhǔn)的計(jì)算值接近某個(gè)閾值或相等,隨機(jī)選擇一個(gè)結(jié)點(diǎn)作為下一個(gè)屬性分裂的結(jié)點(diǎn),進(jìn)一步引入了一個(gè)基于堆的屬性判斷方法,以此來(lái)提高分類精確率。通過(guò)實(shí)驗(yàn)驗(yàn)證,該方法可以有效提高某些特定數(shù)據(jù)集的分類精確率。(3)本文進(jìn)一步針對(duì)決策樹分類精確率不高以及過(guò)度擬合的問題,引入了基于分類規(guī)則的方法。利用改進(jìn)的決策樹算法N次隨機(jī)抽樣生成N個(gè)決策樹分類器,再?gòu)倪@些分類器中挑選出最優(yōu)的分類規(guī)則,生成最終的決策樹模型。經(jīng)過(guò)實(shí)驗(yàn)驗(yàn)證,該算法相比已有算法,在分類效率和分類準(zhǔn)確率上都有相應(yīng)的提高。
[Abstract]:Data mining is a process of discovering laws in a large number of existing data. In recent years, intelligent knowledge extraction in a large number of data has attracted wide attention in the industry. Data mining includes classification, clustering, association analysis and other mining methods. Decision tree algorithm plays an irreplaceable role in the field of data mining because it is simple, efficient and easy to understand. In the existing decision tree algorithms, most of the criteria for computing decision tree splitting nodes are based on Shannon's information entropy, which needs repeated logarithmic operations, so the classification efficiency is not high. Because of the randomness of the existing algorithms in selecting candidate nodes, the classifier is unable to further select the case where the criterion of attribute splitting is the same, thus reducing the accuracy of prediction classification. In order to avoid the complex logarithmic operation and improve the utilization of CPU, this paper aims at the problem that the classification efficiency of the existing decision tree algorithm is not high. An improved optimization function of attribute judgment criterion is proposed. The comparison experiment shows that the optimized function can effectively improve the classification efficiency and the utilization ratio of CPU.) in this paper, we aim at the problem of low accuracy rate of the decision tree classifier. In order to avoid when two or more attribute judgment criteria are close to a threshold value or equal, a heap based attribute judgment method is further introduced by randomly selecting one node as the next attribute split node. In order to improve the accuracy of classification. Experimental results show that this method can effectively improve the classification accuracy rate of some specific data sets. (3) in this paper, we further introduce a method based on classification rules to solve the problem of low classification accuracy rate and over-fitting of decision trees. The improved decision tree algorithm is used to generate N decision tree classifiers by random sampling, and the optimal classification rules are selected from these classifiers to generate the final decision tree model. The experimental results show that compared with the existing algorithms, the proposed algorithm can improve the classification efficiency and classification accuracy.
【學(xué)位授予單位】:大連海事大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 姚程寬;光峰;盧燦舉;曹立勇;詹U,
本文編號(hào):1917892
本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1917892.html
最近更新
教材專著