基于評(píng)論性網(wǎng)站用戶發(fā)言的數(shù)據(jù)挖掘研究

發(fā)布時(shí)間：2018-10-13 16:07

【摘要】：隨著網(wǎng)絡(luò)的蓬勃發(fā)展,互聯(lián)網(wǎng)上出現(xiàn)很多與用戶形成良好互動(dòng)的評(píng)論性網(wǎng)站,這些網(wǎng)站最突出的特點(diǎn)是實(shí)時(shí)性和信息的快速交替性。正是由于這些特點(diǎn),這些評(píng)論性網(wǎng)站上隱藏了很多有價(jià)值的知識(shí),挖掘這些潛在的知識(shí)對(duì)社會(huì)發(fā)展有很重要的指導(dǎo)意義。本文選取這類網(wǎng)站中最典型的代表BBS網(wǎng)站作為研究對(duì)象,通過使用搜索引擎對(duì)其評(píng)論性內(nèi)容進(jìn)行數(shù)據(jù)挖掘,提取出潛在的有價(jià)值信息。本文采用新的網(wǎng)頁排序算法(P-OPIC算法),提高了網(wǎng)頁內(nèi)容的挖掘力度,讓用戶更加快速地定位到目標(biāo)網(wǎng)頁。本文研究了搜索引擎的組成和框架,對(duì)開源搜索引擎Nutch的運(yùn)行機(jī)制進(jìn)行研究分析,主要工作內(nèi)容分為以下幾個(gè)方面： (1)詳細(xì)對(duì)Nutch的爬蟲框架和索引框架進(jìn)行研究,對(duì)Nutch的運(yùn)行流程進(jìn)行深入分析。研究了PageRank算法、HITS算法和OPIC算法,提出基于OPIC算法的優(yōu)化算法。優(yōu)化算法加入網(wǎng)頁P(yáng)ageRank值和BBS網(wǎng)站調(diào)整因子,其中調(diào)整因子提高了BBS網(wǎng)頁排名的穩(wěn)定性 (2)研究了Nutch的數(shù)據(jù)結(jié)構(gòu),在Nutch中添加新的數(shù)據(jù)結(jié)構(gòu)并實(shí)現(xiàn)中文分詞功能。通過修改Nutch源代碼的數(shù)據(jù),減少算法對(duì)搜索引擎系統(tǒng)性能的影響。 (3)提出實(shí)驗(yàn)方法對(duì)算法的性能進(jìn)行研究,分別對(duì)OPIC算法和基于OPIC的改進(jìn)算法進(jìn)行數(shù)據(jù)對(duì)比。算法在BBS數(shù)據(jù)環(huán)境下測(cè)試,本文提出的改進(jìn)算法能夠很好的理解用戶輸入的關(guān)鍵詞,網(wǎng)頁排序效果也比OPIC算法好很多,網(wǎng)頁排序的準(zhǔn)確度有很明顯的提高。分析對(duì)比算法的實(shí)驗(yàn)結(jié)果,總結(jié)算法的優(yōu)勢(shì)和劣勢(shì)。
[Abstract]:With the rapid development of the network, there are many critical websites with good interaction with users on the Internet. The most outstanding characteristics of these websites are real-time and rapid alternation of information. Because of these characteristics, these critical websites hide a lot of valuable knowledge, mining these potential knowledge has a very important guiding significance for social development. In this paper, the most typical representative BBS sites of this kind of websites are selected as the research object, and the potential valuable information is extracted by using search engine to mine the data of its critical content. In this paper, a new sorting algorithm (P-OPIC algorithm) is used to improve the mining of web content, which enables users to locate the target pages more quickly. In this paper, the composition and framework of search engine are studied, and the operating mechanism of open source search engine (Nutch) is analyzed. The main work is as follows: (1) the crawler framework and index framework of Nutch are studied in detail. The running process of Nutch is analyzed in depth. PageRank algorithm, HITS algorithm and OPIC algorithm are studied, and an optimization algorithm based on OPIC algorithm is proposed. The optimization algorithm adds the PageRank value of the web page and the adjustment factor of the BBS website. Among them, the adjustment factor improves the stability of the BBS page ranking. (2) the data structure of the Nutch is studied, a new data structure is added to the Nutch and the Chinese word segmentation function is realized. By modifying the data of Nutch source code, the influence of the algorithm on the performance of search engine system is reduced. (3) the experimental method is proposed to study the performance of the algorithm, and the data comparison between the OPIC algorithm and the improved algorithm based on OPIC is carried out. The algorithm is tested in the BBS data environment. The improved algorithm proposed in this paper can understand the keywords input by the user very well, and the sorting effect of the web page is much better than that of the OPIC algorithm, and the accuracy of the web page sorting is obviously improved. The experimental results of the algorithm are analyzed and compared, and the advantages and disadvantages of the algorithm are summarized.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP311.13;TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 王仕仲;寧龍兵;;基于Nutch的中文搜索引擎的研究與實(shí)現(xiàn)[J];電腦開發(fā)與應(yīng)用;2009年07期

2 羅武;方逵;朱興輝;;網(wǎng)絡(luò)搜索引擎排序算法研究進(jìn)展[J];湖南農(nóng)業(yè)科學(xué);2010年07期

3 鄒濤;王繼成;楊文清;張福炎;;文本信息檢索技術(shù)[J];計(jì)算機(jī)科學(xué);1999年09期

4 姚文琳;劉文;;一種基于本體的PageRank算法的改進(jìn)策略[J];計(jì)算機(jī)工程;2009年06期

5 劉昌鈺,唐常杰,于中華,杜永萍,郭穎;基于潛在語義分析的BBS文檔Bayes鑒別器[J];計(jì)算機(jī)學(xué)報(bào);2004年04期

6 沈華偉;程學(xué)旗;陳海強(qiáng);劉悅;;基于信息瓶頸的社區(qū)發(fā)現(xiàn)[J];計(jì)算機(jī)學(xué)報(bào);2008年04期

7 張珩;;淺析基于BBS數(shù)據(jù)挖掘的研究[J];科技信息;2009年15期

8 何莘;王琬蕪;;自然語言檢索中的中文分詞技術(shù)研究進(jìn)展及應(yīng)用[J];情報(bào)科學(xué);2008年05期

9 曹軍;Google的PageRank技術(shù)剖析[J];情報(bào)雜志;2002年10期

10 梁正友;潘濤;;Nutch中PageRank的并行實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年20期

，

本文編號(hào)：2269206

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2269206.html

上一篇：試論網(wǎng)絡(luò)信息獲取的影響因素與對(duì)策
下一篇：微信導(dǎo)覽應(yīng)用現(xiàn)狀及對(duì)圖書館服務(wù)營銷的啟示

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于評(píng)論性網(wǎng)站用戶發(fā)言的數(shù)據(jù)挖掘研究