天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 自動化論文 >

基于特征抽取的集成學習算法研究

發(fā)布時間:2018-07-04 10:42

  本文選題:集成學習 + 特征抽取 ; 參考:《山東師范大學》2017年碩士論文


【摘要】:學習系統(tǒng)泛化能力的提升一直是機器學習研究的重點。單一分類器無法避免的局限和不足導致其分類性能的提升遇到瓶頸。集成學習作為新的機器學習模式,采用若干個單一分類器預測同一問題,分類結果由各學習器共同決定,并按某種規(guī)則進行集成。集成學習使得各分類器優(yōu)勢互補,極大提升了分類系統(tǒng)的泛化能力和分類性能,被廣泛應用于生物醫(yī)學、信息科學等各個領域。隨著互聯(lián)網技術向社會生活各個領域滲透,待處理的數據也變得愈加復雜。其中,不平衡數據、高維數據、噪聲數據等各種類型數據普遍存在。傳統(tǒng)的集成學習方法處理規(guī)范數據性能較好,而對于復雜數據分類效果有限。因此,在集成學習中融入數據處理方法顯得尤為重要。特征抽取是數據分析處理的重要手段之一,在數據降維,消除噪聲冗余等方面有著廣泛的應用。本文在對集成學習算法深入研究的基礎上,將特征抽取等數據處理算法與集成學習算法相結合,提出了改進后的集成學習算法,具體如下:不平衡數據通常會導致分類器對少數類樣本分類效果較差。為了降低數據集的不平衡比例,可以采用SMOTE過采樣算法對數據預處理。本文使用獨立成分分析算法(ICA)消除數據噪聲,同時融入SMOTE算法平衡數據,使得處理后的數據對集成學習算法具有較好的適應性。實驗結果表明,本文提出的方法能顯著提升集成學習算法Bagging對不平衡數據的分類性能。不同類型的數據都存在一定的組織方式和結構信息,屬性之間相互關聯(lián)。經過研究分析,垃圾網頁數據集特征屬性不僅維度高而且關聯(lián)度也較高。針對垃圾網頁內容特征和鏈接特征之間的高維性和關聯(lián)性,本文在對垃圾網頁特征屬性深入研究的基礎上,對其關聯(lián)屬性分組進行主成分分析(PCA),而非整體主成分分析。這在降低維度的同時,一定程度的保護了數據集原有的屬性結構。實驗結果表明,本文提出的方法在應用于垃圾網頁分類時具有較好的性能。
[Abstract]:The improvement of generalization ability of learning system has been the focus of machine learning research. The limitation and deficiency of single classifier lead to the bottleneck of its classification performance. As a new machine learning model, ensemble learning uses several single classifiers to predict the same problem. Ensemble learning makes each classifier complement each other, greatly improves the generalization ability and classification performance of classification system, and is widely used in biomedicine, information science and other fields. As Internet technology penetrates into all areas of social life, the data to be processed become more complex. Among them, unbalanced data, high-dimensional data, noise data and other types of data generally exist. Traditional ensemble learning methods have better performance for standard data processing, but limited effect for complex data classification. Therefore, it is very important to integrate data processing methods into integrated learning. Feature extraction is one of the most important methods in data analysis and processing. It is widely used in data dimensionality reduction, noise redundancy elimination and so on. Based on the in-depth study of the integrated learning algorithm, this paper combines the feature extraction and other data processing algorithms with the integrated learning algorithm, and proposes an improved ensemble learning algorithm. The main results are as follows: unbalanced data usually lead to poor classification performance for a few samples. In order to reduce the imbalance ratio of data sets, SMOTE oversampling algorithm can be used to preprocess the data. In this paper, the independent component analysis (ICA) algorithm is used to eliminate the data noise and the SMOTE algorithm is used to balance the data, which makes the processed data more adaptable to the ensemble learning algorithm. The experimental results show that the proposed method can significantly improve the classification performance of the integrated learning algorithm bagging for unbalanced data. Different types of data have a certain organization and structure information, and attributes are related to each other. Through research and analysis, the feature attribute of garbage page dataset is not only high dimension but also high correlation degree. In view of the high dimension and relevance between the content features and link features of spam pages, this paper makes a principal component analysis (PCA) instead of global principal component analysis (PCA) on the basis of in-depth research on the feature attributes of spam pages. This not only reduces the dimension, but also protects the original attribute structure of the data set to a certain extent. The experimental results show that the proposed method has good performance in the classification of garbage pages.
【學位授予單位】:山東師范大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP181

【參考文獻】

相關期刊論文 前2條

1 JI Hua;ZHANG Huaxiang;;Analysis on the Content Features and Their Correlation of Web Pages for Spam Detection[J];中國通信;2015年03期

2 付忠良;;關于AdaBoost有效性的分析[J];計算機研究與發(fā)展;2008年10期

,

本文編號:2095809

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/2095809.html


Copyright(c)文論論文網All Rights Reserved | 網站地圖 |

版權申明:資料由用戶7e6ea***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com