面向質(zhì)量安全的元搜索數(shù)據(jù)采集系統(tǒng)的設(shè)計與實現(xiàn)

發(fā)布時間：2019-05-11 07:53

【摘要】：目前質(zhì)量安全問題頻發(fā)，并且隨著互聯(lián)網(wǎng)的普及，質(zhì)量安全問題越來越多的在互聯(lián)網(wǎng)上被大眾討論。人們在互聯(lián)網(wǎng)上發(fā)表的關(guān)于質(zhì)量安全的評論和互聯(lián)網(wǎng)媒體對質(zhì)量安全方面的報道都可以作為質(zhì)量安全分析的文本語料。因此互聯(lián)網(wǎng)可以成為質(zhì)量安全信息獲取的數(shù)據(jù)源，為質(zhì)量安全分析提供了數(shù)據(jù)基礎(chǔ)。本文設(shè)計與實現(xiàn)了基于元搜索的數(shù)據(jù)采集系統(tǒng)，負責采集質(zhì)量安全相關(guān)方面的網(wǎng)頁。本文中，，元搜索引擎不再是傳統(tǒng)的使用方式，而是用于根據(jù)用戶設(shè)定的查詢詞來進行數(shù)據(jù)采集。系統(tǒng)在功能上主要分為元搜索查詢、網(wǎng)頁抽取、相關(guān)性判定三個功能塊。在元搜索功能塊中完成了不同元搜索引擎的封裝，同時對查詢采用了優(yōu)先級調(diào)度方式的管理。在網(wǎng)頁抽取功能塊中采用了基于模板解析和基于統(tǒng)計解析兩種方式：基于模板解析主要負責結(jié)果鏈接的抽取、基于統(tǒng)計的解析則作為通用的正文抽取方法。在相關(guān)性判定功能塊中，采用了支持向量機的分類算法來篩選質(zhì)量安全相關(guān)數(shù)據(jù)，去除噪音信息。本文最后對網(wǎng)頁抽取效果與分類效果進行了測試，并展示了系統(tǒng)運行成果。由于質(zhì)量安全相關(guān)數(shù)據(jù)在互聯(lián)網(wǎng)上較為分散、數(shù)據(jù)特征明顯的特點，本文放棄了使用定向爬蟲模式采集數(shù)據(jù)，而在元搜索引擎用于數(shù)據(jù)采集作了一次嘗試。本文對其他領(lǐng)域的數(shù)據(jù)采集研究有一定的借鑒意義。
[Abstract]:At present, quality and safety problems occur frequently, and with the popularity of the Internet, quality and safety issues are more and more discussed by the public on the Internet. Comments on quality and safety published on the Internet and Internet media reports on quality and safety can be used as textual data for quality and safety analysis. Therefore, the Internet can become the data source of quality and safety information acquisition, which provides the data basis for quality and safety analysis. In this paper, a data acquisition system based on meta-search is designed and implemented, which is responsible for collecting web pages related to quality and safety. In this paper, meta-search engine is no longer the traditional way to use, but is used to collect data according to the query words set by the user. The function of the system is mainly divided into three functional blocks: meta-search query, web page extraction and correlation determination. The different meta-search engines are encapsulated in the meta-search function block, and the query is managed by priority scheduling. In the function block of web page extraction, two methods based on template analysis and statistical analysis are adopted: template analysis is mainly responsible for the extraction of result links, and statistical analysis is used as a general text extraction method. The classification algorithm of support vector machine is used to filter the quality and safety related data and remove the noise information in the correlation decision function block. Finally, the paper tests the effect of web page extraction and classification, and shows the results of the system. Because the quality and safety related data are scattered on the Internet and the data characteristics are obvious, this paper abandons the use of targeted crawler mode to collect data, and makes an attempt to use meta-search engine for data acquisition. This paper has certain reference significance to other fields of data acquisition research.
【學位授予單位】：華中科技大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP274.2

【參考文獻】

相關(guān)期刊論文前10條

1 吳東辰;;國內(nèi)外幾種主要搜索引擎比較[J];福建圖書館理論與實踐;2005年04期

2 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁主題信息自動提取[J];計算機研究與發(fā)展;2004年10期

3 孟軍;劉秋水;王秀坤;;節(jié)點頻度和語義距離相結(jié)合的網(wǎng)頁正文信息抽取[J];計算機工程與應用;2009年01期

4 彭洪匯;林作銓;;Internet上的搜索引擎和元搜索引擎[J];計算機科學;2002年09期

5 陸安江;董旭暉;;個性化元搜索引擎模型的研究與設(shè)計[J];計算機與現(xiàn)代化;2011年01期

6 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學報;2004年05期

7 詹勇;;質(zhì)量安全是企業(yè)首要責任[J];決策導刊;2008年10期

8 李綱;戴強斌;;WNBTE網(wǎng)頁正文抽取方法研究[J];情報科學;2008年03期

9 龔蛟騰;元搜索引擎研究[J];情報雜志;2004年10期

10 原福永;梁順攀;;元搜索引擎的現(xiàn)狀與發(fā)展[J];計算機工程與設(shè)計;2005年12期

相關(guān)博士學位論文前1條

1 杜亞軍;搜索引擎智能行為的研究及實現(xiàn)[D];西南交通大學;2005年

相關(guān)碩士學位論文前3條

1 王春艷;元搜索引擎的研究與實現(xiàn)[D];吉林大學;2011年

2 陳劍敏;基于Bayes方法的文本分類器的研究與實現(xiàn)[D];重慶大學;2007年

3 吳鵬;支持向量機文本分類算法的研究及其應用[D];大連理工大學;2009年

本文編號：2474318

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2474318.html

上一篇：新一代搜索引擎UJIK0
下一篇：專業(yè)搜索引擎系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向質(zhì)量安全的元搜索數(shù)據(jù)采集系統(tǒng)的設(shè)計與實現(xiàn)