當(dāng)前位置：主頁(yè) > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

面向Web文本挖掘的主題網(wǎng)絡(luò)爬蟲研究

發(fā)布時(shí)間：2018-11-25 07:24

【摘要】：隨著Web3.0時(shí)代的到來(lái),互聯(lián)網(wǎng)中Web頁(yè)面的數(shù)量和復(fù)雜性呈現(xiàn)出爆炸性增長(zhǎng)趨勢(shì),伴隨的是包含在Web頁(yè)面中的信息也呈幾何數(shù)量級(jí)增長(zhǎng)。Web頁(yè)面信息通常是由Web頁(yè)面中的文本體現(xiàn)出來(lái)的,因此Web文本數(shù)據(jù)中隱藏著豐富的,對(duì)用戶有價(jià)值的知識(shí)和規(guī)則。但是由于Web文本數(shù)據(jù)半結(jié)構(gòu)化、實(shí)時(shí)性和離散性等特點(diǎn),用戶很難直接從如此復(fù)雜的數(shù)據(jù)集中獲取到自己需要的知識(shí)。因此如何有效的從海量的Web本文數(shù)據(jù)中挖掘出用戶真正關(guān)心的信息和知識(shí),并以用戶能夠理解的方式呈現(xiàn)出來(lái),是當(dāng)下非常熱門的研究課題。本文主要從獲取Web文本數(shù)據(jù)和對(duì)Web文本數(shù)據(jù)的分析兩方面著手,對(duì)如何準(zhǔn)確且高效的獲取用戶所需要的Web文本信息,并挖掘其中有價(jià)值的知識(shí)展開研究。本文具體的研究工作如下:主題網(wǎng)絡(luò)爬蟲:首先綜合分析了現(xiàn)有的主題網(wǎng)絡(luò)爬蟲實(shí)現(xiàn)的原理及結(jié)構(gòu),然后對(duì)主題網(wǎng)絡(luò)爬蟲的分類進(jìn)行介紹,選擇功能型主題網(wǎng)絡(luò)爬蟲為本文研究的重點(diǎn)。最后分析了網(wǎng)絡(luò)爬蟲實(shí)現(xiàn)語(yǔ)言,選擇Node.js這門新興語(yǔ)言來(lái)實(shí)現(xiàn)針對(duì)主題網(wǎng)絡(luò)社區(qū)的主題網(wǎng)絡(luò)爬蟲。Web文本表示模型:首先綜合分析了現(xiàn)有的文本表示模型,然后從本文所面對(duì)的Web文本數(shù)據(jù)以短文本為主的實(shí)際情況出發(fā),結(jié)合自然語(yǔ)言處理中關(guān)鍵詞提取和詞向量表示的相關(guān)技術(shù),提出一種基于關(guān)鍵詞向量的文本表示模型。Web文本聚類算法:首先介紹了Web文本挖掘技術(shù)的定義。其次詳細(xì)介紹了Web文本挖掘中的聚類挖掘技術(shù)。在分析了Web文本聚類算法分類的基礎(chǔ)上,選取BIRCH算法為本文的Web文本聚類算法,然后在分析了BIRCH算法缺點(diǎn)和不足,并提出一種新的Web文本聚類算法。在以上研究?jī)?nèi)容的基礎(chǔ)上,將Web文本挖掘技術(shù)和主題網(wǎng)絡(luò)爬蟲技術(shù)的研究成果相結(jié)合,設(shè)計(jì)并實(shí)現(xiàn)了面向主題網(wǎng)絡(luò)社區(qū)的信息獲取與分析系統(tǒng)。
[Abstract]:With the advent of the Web3.0 era, the number and complexity of Web pages in the Internet show an explosive growth trend. The information contained in the Web page also increases in geometric order. The information of the Web page is usually reflected by the text in the Web page, so there are abundant knowledge and rules in the Web text data that are valuable to the user. However, due to the semi-structured, real-time and discrete characteristics of Web text data, it is difficult for users to obtain the knowledge they need directly from such a complex data set. Therefore, how to effectively mine the information and knowledge that users really care about from the massive Web data, and present it in a way that users can understand, is a very hot research topic. This paper mainly starts from two aspects: obtaining Web text data and analyzing Web text data. It studies how to accurately and efficiently obtain the Web text information needed by users and mine the valuable knowledge. The specific research work of this paper is as follows: firstly, the principle and structure of the implementation of topic web crawler are synthetically analyzed, and then the classification of theme web crawler is introduced. Select functional theme web crawler as the focus of this study. Finally, this paper analyzes the implementation language of web crawler, and chooses Node.js as a new language to implement the text representation model of topic web crawler. Web text representation model for topic network community is implemented. Firstly, the existing text representation model is analyzed synthetically. Then, based on the fact that the Web text data in this paper is mainly short text, combined with the related techniques of keyword extraction and word vector representation in natural language processing, This paper presents a text representation model based on keyword vector. Web text clustering algorithm: firstly, the definition of Web text mining technology is introduced. Secondly, the clustering mining technology in Web text mining is introduced in detail. On the basis of analyzing the classification of Web text clustering algorithm, BIRCH algorithm is selected as the Web text clustering algorithm in this paper. Then, the shortcomings and shortcomings of BIRCH algorithm are analyzed, and a new Web text clustering algorithm is proposed. On the basis of the above research, this paper designs and implements the information acquisition and analysis system for the topic network community by combining the research results of Web text mining technology and topic web crawler technology.
【學(xué)位授予單位】：電子科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1;TP393.09

【參考文獻(xiàn)】

相關(guān)期刊論文前6條

1 吳威;;基于Web文本挖掘算法預(yù)防現(xiàn)實(shí)危害的研究[J];信息網(wǎng)絡(luò)安全;2016年09期

2 薛蘇琴;牛永潔;;基于向量空間模型的中文文本相似度的研究[J];電子設(shè)計(jì)工程;2016年10期

3 史玉珍;單冬紅;;基于子主題選擇與三級(jí)分層結(jié)構(gòu)的Web文本挖掘方法[J];電信科學(xué);2016年05期

4 張志昌;周慧霞;姚東任;魯小勇;;基于詞向量的中文詞匯蘊(yùn)涵關(guān)系識(shí)別[J];計(jì)算機(jī)工程;2016年02期

5 俞忻峰;;社交網(wǎng)絡(luò)挖掘方案研究[J];現(xiàn)代電子技術(shù);2015年04期

6 許鑫;郭金龍;姚占雷;;基于Web文本挖掘的行業(yè)態(tài)勢(shì)分析——以2011上海車展為例[J];圖書情報(bào)工作;2012年16期

相關(guān)碩士學(xué)位論文前10條

1 劉小云;網(wǎng)絡(luò)爬蟲技術(shù)在云平臺(tái)上的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2016年

2 王琨;面向教育輿情的主題網(wǎng)絡(luò)爬蟲設(shè)計(jì)與實(shí)現(xiàn)[D];南華大學(xué);2015年

3 陳千;主題網(wǎng)絡(luò)爬蟲關(guān)鍵技術(shù)的研究與應(yīng)用[D];北京理工大學(xué);2015年

4 楊志國(guó);基于WEB挖掘和文本分析的動(dòng)態(tài)網(wǎng)絡(luò)輿情預(yù)警研究[D];武漢理工大學(xué);2014年

5 唐東;基于XML和SVM的Web文本挖掘系統(tǒng)研究[D];電子科技大學(xué);2014年

6 湯卓;基于Web文本挖掘的網(wǎng)絡(luò)口碑分析系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];華中科技大學(xué);2013年

7 仰孝富;基于BIRCH改進(jìn)算法的文本聚類研究[D];北京林業(yè)大學(xué);2013年

8 趙茉莉;網(wǎng)絡(luò)爬蟲系統(tǒng)的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2013年

9 張宏兵;Web文本挖掘技術(shù)在網(wǎng)頁(yè)推薦中的應(yīng)用研究[D];南京理工大學(xué);2013年

10 張曉雷;面向Web挖掘的主題網(wǎng)絡(luò)爬蟲的研究與實(shí)現(xiàn)[D];西安電子科技大學(xué);2012年

，

本文編號(hào)：2355288

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/guanlilunwen/ydhl/2355288.html

上一篇：大功率干擾下LVS負(fù)載均衡集群抗擾動(dòng)算法
下一篇：面向云制造的可視化關(guān)鍵技術(shù)研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向Web文本挖掘的主題網(wǎng)絡(luò)爬蟲研究