基于支持向量機(jī)的網(wǎng)頁(yè)文本分類技術(shù)研究

發(fā)布時(shí)間：2019-06-21 09:18

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展，網(wǎng)絡(luò)上的網(wǎng)頁(yè)信息成指數(shù)級(jí)增長(zhǎng)。人們希望對(duì)網(wǎng)頁(yè)進(jìn)行快速分類，從而有效地獲取有價(jià)值的信息。網(wǎng)頁(yè)文本分類是實(shí)現(xiàn)快速信息檢索的一項(xiàng)重要技術(shù)。目前，網(wǎng)頁(yè)文本分類技術(shù)已經(jīng)在數(shù)字圖書館、搜索引擎、新聞分類等應(yīng)用領(lǐng)域得到了廣泛的應(yīng)用，具有重要的研究?jī)r(jià)值。網(wǎng)頁(yè)文本分類是以純文本分類為技術(shù)基礎(chǔ)的，文本表示常采用的方法是向量空間模型，而文本向量具有高維、稀疏性大等特征，大多數(shù)分類算法會(huì)出現(xiàn)維災(zāi)難。支持向量機(jī)（SVM）不僅有著扎實(shí)的理論基礎(chǔ)，而且在處理高維數(shù)據(jù)的時(shí)候能有效地避免維數(shù)災(zāi)難，具有較好的泛化性能。因此，支持向量機(jī)是解決文本分類問(wèn)題一個(gè)常用方法之一，在文本分類中有著很大的應(yīng)用價(jià)值。本文主要的研究工作包括： 1、介紹了網(wǎng)頁(yè)文本分類的研究背景和意義，以及文本分類在國(guó)內(nèi)外的研究現(xiàn)狀和網(wǎng)頁(yè)文本分類技術(shù)的研究熱點(diǎn)問(wèn)題。對(duì)網(wǎng)頁(yè)文本分類的相關(guān)技術(shù)進(jìn)行了詳細(xì)地分析，這些關(guān)鍵技術(shù)包括：網(wǎng)頁(yè)文本預(yù)處理、網(wǎng)頁(yè)文本表示方法、常用的特征選擇方法、文本分類的幾種評(píng)估標(biāo)準(zhǔn)和幾種常見(jiàn)的文本分類技術(shù)。并深入地介紹了支持向量機(jī)的原理和技術(shù)。 2、提出了一種改進(jìn)的權(quán)重計(jì)算方法。由于網(wǎng)頁(yè)中不同標(biāo)簽內(nèi)的特征項(xiàng)對(duì)于分類的影響是不同的，并且特征項(xiàng)在正文中出現(xiàn)的不同位置也有不同的語(yǔ)義特點(diǎn)，因此針對(duì)這些特征，，本文對(duì)網(wǎng)頁(yè)特征進(jìn)行了詳細(xì)分析，并提出了一種根據(jù)HTML語(yǔ)義和特征項(xiàng)的位置對(duì)特征項(xiàng)進(jìn)行加權(quán)處理的權(quán)重計(jì)算方法。通過(guò)實(shí)驗(yàn)表明，使用該改進(jìn)方法來(lái)處理網(wǎng)頁(yè)文本，最終能得到相對(duì)較好的分類效果。 3、目前支持向量機(jī)在處理大規(guī)模樣本集時(shí)，會(huì)消耗大量的時(shí)間和過(guò)大的內(nèi)存。針對(duì)這個(gè)問(wèn)題，本文研究了支持向量機(jī)的特性，發(fā)現(xiàn)SVM的訓(xùn)練結(jié)果僅與支持向量有關(guān)，由此對(duì)支持向量機(jī)方法進(jìn)行改進(jìn)，提出了一種基于模糊聚類的兩階段支持向量機(jī)算法。該算法首先通過(guò)模糊C均值聚類算法對(duì)初始樣本集進(jìn)行約簡(jiǎn)，僅使用統(tǒng)一簇的中心點(diǎn)和混合簇中所有樣本作為訓(xùn)練集。若該樣本集包含有足夠多樣本，則僅對(duì)樣本進(jìn)行一次加權(quán)支持向量機(jī)訓(xùn)練，算法結(jié)束。若該樣本集僅占原始樣本的一小部分，則可能會(huì)因?yàn)閬G棄了大量對(duì)分類有效的支持向量，極大地降低了分類的精度，因此依據(jù)第一階段加權(quán)SVM得到的近似最優(yōu)超平面，對(duì)靠近該超平面的聚類中心點(diǎn)解聚類。將解聚類后的樣本和混合簇樣本作為訓(xùn)練集，進(jìn)行第二階段的標(biāo)準(zhǔn)SVM操作，得到最終的最優(yōu)超平面。通過(guò)實(shí)驗(yàn)表明，該方法基本保持了標(biāo)準(zhǔn)SVM的分類精度，并加快了訓(xùn)練速度。改進(jìn)的分類方法在大規(guī)模的樣本集上有著明顯的優(yōu)勢(shì)。
[Abstract]:With the rapid development of Internet technology, the web page information on the network has become exponential growth. People hope to classify web pages quickly so as to obtain valuable information effectively. Web text classification is an important technology to realize fast information retrieval. At present, web text classification technology has been widely used in digital library, search engine, news classification and other application fields, and has important research value. Web page text classification is based on pure text classification. Vector space model is often used in text representation, and text vector has the characteristics of high dimension and sparsity, so most classification algorithms will have dimensional disaster. Support vector machine (SVM) not only has a solid theoretical basis, but also can effectively avoid dimension disaster when dealing with high-dimensional data, and has good generalization performance. Therefore, support vector machine (SVM) is one of the common methods to solve the problem of text classification, and it has great application value in text classification. The main research work of this paper is as follows: 1. The research background and significance of web text classification are introduced, as well as the research status of text classification at home and abroad and the research hot issues of web text classification technology. The related technologies of web text classification are analyzed in detail. These key technologies include: Web text preprocessing, web text representation, common feature selection methods, several evaluation criteria of text classification and several common text classification techniques. The principle and technology of support vector machine are introduced in detail. 2. An improved weight calculation method is proposed. Because the influence of feature items in different tags on classification is different, and the different positions of feature items in the text also have different semantic features, this paper analyzes the features in detail, and proposes a weighted calculation method of feature items according to the HTML semantics and the position of feature items. The experimental results show that the improved method can be used to deal with web page text, and finally, a relatively good classification effect can be obtained. At present, support vector machines consume a lot of time and memory when dealing with large sample sets. In order to solve this problem, this paper studies the characteristics of support vector machine, and finds that the training results of SVM are only related to support vector. Therefore, the support vector machine method is improved, and a two-stage support vector machine algorithm based on fuzzy clustering is proposed. Firstly, the fuzzy C-means clustering algorithm is used to reduce the initial sample set, and only the center point of the unified cluster and all the samples in the mixed cluster are used as the training set. If the sample set contains enough samples, only one weighted support vector machine training is performed on the samples, and the algorithm ends. If the sample set accounts for only a small part of the original sample, the classification accuracy may be greatly reduced by discarding a large number of effective support vectors for the classification. Therefore, according to the approximate optimal hyperplane obtained by the first stage weighted SVM, the clustering of the clustering center points near the hyperplane may be solved. The samples after de-clustering and the mixed cluster samples are taken as the training set, and the standard SVM operation in the second stage is carried out to obtain the final optimal hyperplane. The experimental results show that the method basically maintains the classification accuracy of standard SVM and accelerates the training speed. The improved classification method has obvious advantages in large sample sets.
【學(xué)位授予單位】：吉林大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP393.092

【引證文獻(xiàn)】

相關(guān)期刊論文前1條

1 郭彥兵;;網(wǎng)頁(yè)文本分類技術(shù)研究[J];科技創(chuàng)業(yè)家;2013年09期

相關(guān)碩士學(xué)位論文前1條

1 薛曉冬;網(wǎng)絡(luò)行為特征模型及在個(gè)性化服務(wù)中的應(yīng)用[D];華南理工大學(xué);2013年

本文編號(hào)：2503962

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2503962.html

上一篇：基于敏捷模式的自動(dòng)化測(cè)試管理系統(tǒng)設(shè)計(jì)
下一篇：巧用搜索引擎

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于支持向量機(jī)的網(wǎng)頁(yè)文本分類技術(shù)研究