時空要素驅(qū)動的事件網(wǎng)頁信息檢索方法研究
發(fā)布時間:2018-01-21 23:33
本文關鍵詞: 網(wǎng)頁文本 事件 時空要素 檢索 “時間—空間—主題”索引 出處:《南京師范大學》2013年碩士論文 論文類型:學位論文
【摘要】:本文依托國家“863”課題“泛在空間信息關聯(lián)更新與面向主題時空信息挖掘研究”,探索面向事件的網(wǎng)頁文本獲取與檢索服務方法,為多源網(wǎng)絡信息的結構化表達、事件時空序列重構、可視化和挖掘分析提供數(shù)據(jù)支撐。本文圍繞事件網(wǎng)頁文本“數(shù)據(jù)獲取—組織管理—檢索服務”的技術主線,通過分析中文網(wǎng)頁文本中事件信息的語言描述和信息組織特征,以自然災害事件為例,開展了時空要素驅(qū)動的事件網(wǎng)頁信息檢索引擎關鍵技術研究。主要研究內(nèi)容與結論包括以下幾個方面: (1)時空要素驅(qū)動的事件網(wǎng)頁獲。和ㄟ^對描述事件網(wǎng)頁文本內(nèi)容及特征進行分析,構建以時間、空間位置和事件主題為基本要素的事件表達模板;依據(jù)事件表達模板中的內(nèi)容,定制網(wǎng)絡爬蟲以獲取描述事件的網(wǎng)頁文本。實驗表明,與傳統(tǒng)爬蟲相比,基于事件表達模板構建的事件主題爬蟲具有良好的網(wǎng)頁過濾功能,獲取的網(wǎng)頁具有較高的精度,但是因為在主題爬蟲中引入了大量的計算,導致該爬蟲的性能相對有所下降。 (2)事件網(wǎng)頁“時間—空間—主題”分布式索引與存儲:利用規(guī)則模型和條件隨機場模型實現(xiàn)了網(wǎng)頁文本中事件相關時間、空間位置和主題信息抽取,提出了基于支持向量機模型的網(wǎng)頁文本事件分類方法;構建了基于“時間—空間—主題”的分布式索引,以解決檢索效率低的問題;基于HBase數(shù)據(jù)庫和HDFS文件系統(tǒng),實現(xiàn)了海量網(wǎng)頁文本的分布式存儲。 (3)“文—圖”交互式事件網(wǎng)頁信息檢索服務:通過歸納總結用戶檢索語句的描述特點,實現(xiàn)了事件信息檢索語句的自動解析;借鑒同義詞林的詞匯組織方式,構建自然災害事件領域詞匯知識庫和相似度檢索模型,實現(xiàn)了候選網(wǎng)頁文本和檢索條件的相似度計算與排序。 (4)原型系統(tǒng)設計與實現(xiàn):基于本文提出的事件網(wǎng)頁獲取方法、分布式索引與存儲方法、檢索服務方法,利用Google Map API,設計了相應的原型系統(tǒng);探討了原型系統(tǒng)的體系架構,以及主要功能模塊。
[Abstract]:Based on the national "863" project, "Research on the updating of Spatial Information Association and Topic-Oriented Spatio-temporal Information Mining", this paper explores the event-oriented web page text acquisition and retrieval services. This paper provides data support for structured expression of multi-source network information, reconstruction of temporal and spatial sequence of events, visualization and mining analysis. This paper focuses on the technology of "data acquisition, organization management and retrieval service" of event web page text. By analyzing the language description and information organization features of event information in Chinese web text, the natural disaster event is taken as an example. The key technologies of event information retrieval engine driven by spatiotemporal factors are studied. The main contents and conclusions include the following aspects: (1) event page acquisition driven by spatio-temporal elements: by analyzing the content and features of the text describing event pages, we construct an event expression template with time, space location and event theme as the basic elements; According to the content of the event expression template, the web crawler is customized to obtain the web page text describing the event. The experiment shows that compared with the traditional crawler. The event topic crawler based on the event expression template has a good web page filtering function, and the obtained web page has a high accuracy, but because of the introduction of a large number of calculations in the topic crawler. As a result, the performance of the reptile is relatively poor. 2) distributed index and storage of event page "time-space-topic": using rule model and conditional random field model to extract information of event related time, space and topic in web page text. A method of web page text event classification based on support vector machine (SVM) model is proposed. In order to solve the problem of low retrieval efficiency, a distributed index based on "time-space-topic" is constructed. Based on HBase database and HDFS file system, distributed storage of massive web page text is realized. (3) "text-Graph" interactive event page information retrieval service: by summarizing the description characteristics of user retrieval statements, the automatic parsing of event information retrieval statements is realized; The lexical knowledge base and similarity retrieval model of natural disaster event domain are constructed based on the lexical organization of synonym forest, and the similarity calculation and ranking of candidate web page text and retrieval conditions are realized. Design and implementation of prototype system: based on the event page acquisition method proposed in this paper, distributed index and storage method, retrieval service method, using Google Map API. The corresponding prototype system is designed. The architecture and main function modules of the prototype system are discussed.
【學位授予單位】:南京師范大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.1;P208
【參考文獻】
相關期刊論文 前5條
1 付劍鋒;劉宗田;付雪峰;周文;仲兆滿;;基于依存分析的事件識別[J];計算機科學;2009年11期
2 車慶男;;基于Lucene的索引系統(tǒng)分析和研究[J];內(nèi)蒙古石油化工;2010年18期
3 譚紅葉;趙鐵軍;王浩暢;;基于向量相似度計算的半監(jiān)督的名實體識別[J];計算機工程與設計;2008年19期
4 邵秀麗;劉彬;張濤;;基于Nutch的垂直搜索引擎的設計和實現(xiàn)[J];計算機工程與設計;2011年02期
5 沈達陽,孫茂松,黃昌寧;基于統(tǒng)計的漢語分詞模型及實現(xiàn)方法[J];中文信息;1998年Z1期
相關碩士學位論文 前1條
1 李勇君;基于Hadoop的海量期貨數(shù)據(jù)的分布式存儲和算法分析[D];天津大學;2012年
,本文編號:1452880
本文鏈接:http://www.sikaile.net/kejilunwen/dizhicehuilunwen/1452880.html
最近更新
教材專著