基于評(píng)論性網(wǎng)站用戶發(fā)言的數(shù)據(jù)挖掘研究
[Abstract]:With the rapid development of the network, there are many critical websites with good interaction with users on the Internet. The most outstanding characteristics of these websites are real-time and rapid alternation of information. Because of these characteristics, these critical websites hide a lot of valuable knowledge, mining these potential knowledge has a very important guiding significance for social development. In this paper, the most typical representative BBS sites of this kind of websites are selected as the research object, and the potential valuable information is extracted by using search engine to mine the data of its critical content. In this paper, a new sorting algorithm (P-OPIC algorithm) is used to improve the mining of web content, which enables users to locate the target pages more quickly. In this paper, the composition and framework of search engine are studied, and the operating mechanism of open source search engine (Nutch) is analyzed. The main work is as follows: (1) the crawler framework and index framework of Nutch are studied in detail. The running process of Nutch is analyzed in depth. PageRank algorithm, HITS algorithm and OPIC algorithm are studied, and an optimization algorithm based on OPIC algorithm is proposed. The optimization algorithm adds the PageRank value of the web page and the adjustment factor of the BBS website. Among them, the adjustment factor improves the stability of the BBS page ranking. (2) the data structure of the Nutch is studied, a new data structure is added to the Nutch and the Chinese word segmentation function is realized. By modifying the data of Nutch source code, the influence of the algorithm on the performance of search engine system is reduced. (3) the experimental method is proposed to study the performance of the algorithm, and the data comparison between the OPIC algorithm and the improved algorithm based on OPIC is carried out. The algorithm is tested in the BBS data environment. The improved algorithm proposed in this paper can understand the keywords input by the user very well, and the sorting effect of the web page is much better than that of the OPIC algorithm, and the accuracy of the web page sorting is obviously improved. The experimental results of the algorithm are analyzed and compared, and the advantages and disadvantages of the algorithm are summarized.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP311.13;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王仕仲;寧龍兵;;基于Nutch的中文搜索引擎的研究與實(shí)現(xiàn)[J];電腦開發(fā)與應(yīng)用;2009年07期
2 羅武;方逵;朱興輝;;網(wǎng)絡(luò)搜索引擎排序算法研究進(jìn)展[J];湖南農(nóng)業(yè)科學(xué);2010年07期
3 鄒濤;王繼成;楊文清;張福炎;;文本信息檢索技術(shù)[J];計(jì)算機(jī)科學(xué);1999年09期
4 姚文琳;劉文;;一種基于本體的PageRank算法的改進(jìn)策略[J];計(jì)算機(jī)工程;2009年06期
5 劉昌鈺,唐常杰,于中華,杜永萍,郭穎;基于潛在語義分析的BBS文檔Bayes鑒別器[J];計(jì)算機(jī)學(xué)報(bào);2004年04期
6 沈華偉;程學(xué)旗;陳海強(qiáng);劉悅;;基于信息瓶頸的社區(qū)發(fā)現(xiàn)[J];計(jì)算機(jī)學(xué)報(bào);2008年04期
7 張珩;;淺析基于BBS數(shù)據(jù)挖掘的研究[J];科技信息;2009年15期
8 何莘;王琬蕪;;自然語言檢索中的中文分詞技術(shù)研究進(jìn)展及應(yīng)用[J];情報(bào)科學(xué);2008年05期
9 曹軍;Google的PageRank技術(shù)剖析[J];情報(bào)雜志;2002年10期
10 梁正友;潘濤;;Nutch中PageRank的并行實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年20期
,本文編號(hào):2269206
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2269206.html