面向Web文本挖掘的主題網(wǎng)絡(luò)爬蟲研究
[Abstract]:With the advent of the Web3.0 era, the number and complexity of Web pages in the Internet show an explosive growth trend. The information contained in the Web page also increases in geometric order. The information of the Web page is usually reflected by the text in the Web page, so there are abundant knowledge and rules in the Web text data that are valuable to the user. However, due to the semi-structured, real-time and discrete characteristics of Web text data, it is difficult for users to obtain the knowledge they need directly from such a complex data set. Therefore, how to effectively mine the information and knowledge that users really care about from the massive Web data, and present it in a way that users can understand, is a very hot research topic. This paper mainly starts from two aspects: obtaining Web text data and analyzing Web text data. It studies how to accurately and efficiently obtain the Web text information needed by users and mine the valuable knowledge. The specific research work of this paper is as follows: firstly, the principle and structure of the implementation of topic web crawler are synthetically analyzed, and then the classification of theme web crawler is introduced. Select functional theme web crawler as the focus of this study. Finally, this paper analyzes the implementation language of web crawler, and chooses Node.js as a new language to implement the text representation model of topic web crawler. Web text representation model for topic network community is implemented. Firstly, the existing text representation model is analyzed synthetically. Then, based on the fact that the Web text data in this paper is mainly short text, combined with the related techniques of keyword extraction and word vector representation in natural language processing, This paper presents a text representation model based on keyword vector. Web text clustering algorithm: firstly, the definition of Web text mining technology is introduced. Secondly, the clustering mining technology in Web text mining is introduced in detail. On the basis of analyzing the classification of Web text clustering algorithm, BIRCH algorithm is selected as the Web text clustering algorithm in this paper. Then, the shortcomings and shortcomings of BIRCH algorithm are analyzed, and a new Web text clustering algorithm is proposed. On the basis of the above research, this paper designs and implements the information acquisition and analysis system for the topic network community by combining the research results of Web text mining technology and topic web crawler technology.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1;TP393.09
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 吳威;;基于Web文本挖掘算法預(yù)防現(xiàn)實(shí)危害的研究[J];信息網(wǎng)絡(luò)安全;2016年09期
2 薛蘇琴;牛永潔;;基于向量空間模型的中文文本相似度的研究[J];電子設(shè)計(jì)工程;2016年10期
3 史玉珍;單冬紅;;基于子主題選擇與三級(jí)分層結(jié)構(gòu)的Web文本挖掘方法[J];電信科學(xué);2016年05期
4 張志昌;周慧霞;姚東任;魯小勇;;基于詞向量的中文詞匯蘊(yùn)涵關(guān)系識(shí)別[J];計(jì)算機(jī)工程;2016年02期
5 俞忻峰;;社交網(wǎng)絡(luò)挖掘方案研究[J];現(xiàn)代電子技術(shù);2015年04期
6 許鑫;郭金龍;姚占雷;;基于Web文本挖掘的行業(yè)態(tài)勢(shì)分析——以2011上海車展為例[J];圖書情報(bào)工作;2012年16期
相關(guān)碩士學(xué)位論文 前10條
1 劉小云;網(wǎng)絡(luò)爬蟲技術(shù)在云平臺(tái)上的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2016年
2 王琨;面向教育輿情的主題網(wǎng)絡(luò)爬蟲設(shè)計(jì)與實(shí)現(xiàn)[D];南華大學(xué);2015年
3 陳千;主題網(wǎng)絡(luò)爬蟲關(guān)鍵技術(shù)的研究與應(yīng)用[D];北京理工大學(xué);2015年
4 楊志國(guó);基于WEB挖掘和文本分析的動(dòng)態(tài)網(wǎng)絡(luò)輿情預(yù)警研究[D];武漢理工大學(xué);2014年
5 唐東;基于XML和SVM的Web文本挖掘系統(tǒng)研究[D];電子科技大學(xué);2014年
6 湯卓;基于Web文本挖掘的網(wǎng)絡(luò)口碑分析系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];華中科技大學(xué);2013年
7 仰孝富;基于BIRCH改進(jìn)算法的文本聚類研究[D];北京林業(yè)大學(xué);2013年
8 趙茉莉;網(wǎng)絡(luò)爬蟲系統(tǒng)的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
9 張宏兵;Web文本挖掘技術(shù)在網(wǎng)頁(yè)推薦中的應(yīng)用研究[D];南京理工大學(xué);2013年
10 張曉雷;面向Web挖掘的主題網(wǎng)絡(luò)爬蟲的研究與實(shí)現(xiàn)[D];西安電子科技大學(xué);2012年
,本文編號(hào):2355288
本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/2355288.html