天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

網(wǎng)絡(luò)文本分類(lèi)技術(shù)研究

發(fā)布時(shí)間:2018-05-03 23:39

  本文選題:網(wǎng)頁(yè)文本提取 + 中文分詞。 參考:《北方工業(yè)大學(xué)》2012年碩士論文


【摘要】:如今,由于網(wǎng)絡(luò)技術(shù)的發(fā)展,使得互聯(lián)網(wǎng)已成為人們獲取信息的主要資源庫(kù)。但網(wǎng)絡(luò)的開(kāi)放性使得網(wǎng)絡(luò)中充滿了各式各樣的信息。為了使人們能夠迅速?gòu)木W(wǎng)絡(luò)中獲取到自己感興趣的信息,如何使用網(wǎng)絡(luò)文本分類(lèi)技術(shù)來(lái)處理雜亂的網(wǎng)絡(luò)信息,讓這些信息資源變得有序,開(kāi)始變得越來(lái)越重要。網(wǎng)絡(luò)文本分類(lèi)技術(shù)是信息過(guò)濾、搜索引擎等領(lǐng)域的基礎(chǔ),因此網(wǎng)絡(luò)文本分類(lèi)技術(shù)已逐步成為當(dāng)今的研究熱點(diǎn)。 本文首先介紹了網(wǎng)絡(luò)文本提取技術(shù)和文本分類(lèi)的相關(guān)理論,如:HTML語(yǔ)言、中文分詞、相似度計(jì)算、權(quán)重值計(jì)算、特征提取以及常用的文本分類(lèi)方法。并且介紹了根據(jù)這些基本的理論方法,設(shè)計(jì)并實(shí)現(xiàn)了網(wǎng)絡(luò)文本分類(lèi)系統(tǒng)。 本文主要進(jìn)行了以下幾方面的研究:在對(duì)網(wǎng)絡(luò)文本提取部分,通過(guò)對(duì)HTML語(yǔ)言特點(diǎn)和一般網(wǎng)頁(yè)結(jié)構(gòu)的分析設(shè)計(jì)實(shí)現(xiàn)了網(wǎng)頁(yè)的文本提取。在文本分類(lèi)部分中,主要詳細(xì)分析了KNN文本分類(lèi)算法和樸素貝葉斯文本分類(lèi)算法,并通過(guò)文本分類(lèi)的算法實(shí)現(xiàn)對(duì)文本的領(lǐng)域分類(lèi)。在對(duì)樸素貝葉斯分類(lèi)方法分析的基礎(chǔ)上,針對(duì)該方法的獨(dú)立性假設(shè)的問(wèn)題,采用了貝葉斯網(wǎng)絡(luò)TAN模型對(duì)貝葉斯分類(lèi)方法進(jìn)行了改進(jìn),考慮了兩詞間的關(guān)系,一定程度上放寬了獨(dú)立性假設(shè)。提出了文本態(tài)度判斷的方法,通過(guò)針對(duì)文本情感特征詞提取,對(duì)情感詞進(jìn)行權(quán)值分析,評(píng)估文本態(tài)度,從而判斷出文本的態(tài)度實(shí)現(xiàn)對(duì)文本的二層分類(lèi)。最后對(duì)網(wǎng)絡(luò)文本分類(lèi)系統(tǒng)測(cè)試,通過(guò)使用語(yǔ)料庫(kù)文本的實(shí)驗(yàn)測(cè)試,證明該系統(tǒng)有一定的準(zhǔn)確性,通過(guò)提取網(wǎng)頁(yè)的文本內(nèi)容對(duì)分類(lèi)系統(tǒng)進(jìn)行實(shí)驗(yàn)測(cè)試,證明該系統(tǒng)有一定的實(shí)用性。
[Abstract]:Nowadays, with the development of network technology, the Internet has become the main resource for people to obtain information. But the openness of the network makes the network full of all kinds of information. In order to get the interesting information from the network quickly, how to use the network text classification technology to deal with the messy network information, make these information resources become orderly, began to become more and more important. Network text classification technology is the basis of information filtering, search engine and other fields, so network text classification technology has gradually become a hot research topic. This paper first introduces the network text extraction technology and the related theories of text classification, such as: HTML language, Chinese word segmentation, similarity calculation, weight calculation, feature extraction and common text classification methods. According to these basic theories and methods, a network text classification system is designed and implemented. This paper mainly studies the following aspects: in the part of web text extraction, the text extraction of web pages is realized through the analysis and design of the characteristics of HTML language and the structure of general web pages. In the part of text classification, KNN text classification algorithm and naive Bayesian text classification algorithm are analyzed in detail, and text domain classification is realized by text classification algorithm. Based on the analysis of the naive Bayesian classification method, the Bayesian network TAN model is used to improve the Bayesian classification method, considering the relationship between the two words. Independence assumptions have been relaxed to some extent. This paper puts forward a method of judging the text attitude. By extracting the emotional feature words of the text, analyzing the weight value of the emotion words and evaluating the text attitude, we can judge the attitude of the text to realize the two-layer classification of the text. Finally, the network text classification system test, through the use of corpus text test, proved that the system has a certain accuracy, by extracting the text content of the web page of the classification system for experimental testing. It is proved that the system is practical.
【學(xué)位授予單位】:北方工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 吳謀碩;;基于遺傳算法的文本分類(lèi)技術(shù)[J];電腦知識(shí)與技術(shù);2011年22期

2 高金勇;徐朝軍;馮奕z,

本文編號(hào):1840635


資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1840635.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4ed1b***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com