特定實(shí)體關(guān)系的識(shí)別和抽取及其系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2019-01-27 15:03

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的進(jìn)步，互聯(lián)網(wǎng)成為人們工作、生活上必不可缺的一部分�；ヂ�(lián)網(wǎng)最大的優(yōu)勢(shì)在于有海量信息供用戶使用。然而，海量信息也帶來了信息搜索的難題。搜索引擎的出現(xiàn)為用戶提供了簡單快捷的信息搜索途徑。用戶通過提交搜索關(guān)鍵詞，就可以利用搜索引擎在海量信息中檢索與關(guān)鍵字相關(guān)的內(nèi)容，并得到內(nèi)容頁面的鏈接地址。但是，即使有搜索引擎的幫助，搜索結(jié)果的精確度依然很難讓用戶滿意，尤其是當(dāng)用戶要搜索的是特定領(lǐng)域的特定信息以及它們之間的關(guān)系時(shí)，通常都需要在搜索引擎結(jié)果中去人工查找、分析。本文基于對(duì)用戶日常工作的調(diào)研，對(duì)用戶感興趣的特定實(shí)體抽取問題以及特定實(shí)體間關(guān)系抽取問題進(jìn)行了研究，通過分析固定格式網(wǎng)頁的信息分布特點(diǎn)，將網(wǎng)頁源文件直接作為字符流來處理，利用正則表達(dá)式匹配技術(shù)對(duì)特定實(shí)體信息進(jìn)行抽取，另外根據(jù)對(duì)用戶需求的分析，，設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)搜索關(guān)鍵詞構(gòu)造器，通過可配置的基礎(chǔ)關(guān)鍵詞和特殊關(guān)鍵詞的組合，向搜索引擎提交不同的搜索請(qǐng)求，以獲取更全面的非固定格式的網(wǎng)頁搜索結(jié)果。在特定實(shí)體關(guān)系識(shí)別和抽取中，使用HTMLParser進(jìn)行頁面處理，提取通用搜索引擎返回的結(jié)果URL及URL指向頁面的文本信息。使用中科院分詞系統(tǒng)進(jìn)行中文分詞和詞性標(biāo)注處理，抽取出網(wǎng)頁文本信息中的人名實(shí)體。使用正則表達(dá)式抽取文本中的電子郵件實(shí)體。最后根據(jù)中文姓名的拼音組合與郵箱前綴的關(guān)聯(lián)特點(diǎn)，通過設(shè)定的抽取規(guī)則，抽取出特定實(shí)體間的關(guān)系。本文還設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)可用的B/S結(jié)構(gòu)信息抽取系統(tǒng)，系統(tǒng)采用JAVA語言開發(fā)，包括三個(gè)主要模塊：用戶接口模塊、特定實(shí)體抽取模塊以及特定實(shí)體關(guān)系抽取模塊，用戶通過接口模塊能夠調(diào)用其他兩個(gè)模塊的功能，實(shí)現(xiàn)信息的自動(dòng)抽取。本文實(shí)現(xiàn)的信息抽取系統(tǒng)與用戶傳統(tǒng)的人工采集、分析工作相比，本系統(tǒng)可以大幅度降低用戶的人工勞動(dòng)，縮短信息的采集和分析時(shí)間，節(jié)約人力物力成本，提高工作效率，而且部署快速、維護(hù)簡單，得到了用戶的好評(píng)。
[Abstract]:With the progress of Internet technology, the Internet has become an indispensable part of people's work and life. The biggest advantage of the Internet is that there is a huge amount of information for users to use. However, mass information also brings the problem of information search. The appearance of search engine provides users with a simple and fast way to search for information. By submitting search keywords, users can use search engines to retrieve the content related to keywords in a large amount of information, and get the link address of the content page. However, even with the help of search engines, the accuracy of search results can be difficult to satisfy users, especially if they are searching for specific information in a particular domain and their relationships. Search engine results usually need to be manually searched and analyzed. Based on the investigation of users' daily work, this paper studies the extraction of specific entities of interest to users and the extraction of relations between specific entities, and analyzes the information distribution characteristics of fixed format web pages. A search keyword constructor is designed and implemented according to the analysis of the user's requirements, using the regular expression matching technology to extract the specific entity information, and the web page source file is directly treated as a character stream, and a search keyword constructor is designed and implemented based on the analysis of the user's requirements. Through configurable combination of basic keywords and special keywords, different search requests are submitted to search engines to obtain more comprehensive and non-fixed web search results. In the identification and extraction of specific entity relationships, HTMLParser is used for page processing to extract the text information returned by the general search engine URL and URL pointing to the page. The segmentation system of Chinese Academy of Sciences is used to deal with Chinese word segmentation and part of speech tagging. Use regular expressions to extract e-mail entities from text. Finally, according to the characteristics of the combination of pinyin of Chinese names and the prefixes of mailbox, the relationship between specific entities is extracted by the set extraction rules. This paper also designs and implements a usable information extraction system of B / S structure. The system is developed with JAVA language, including three main modules: user interface module, specific entity extraction module and specific entity relation extraction module. The user can call the function of the other two modules through the interface module to realize the automatic extraction of information. The information extraction system realized in this paper can greatly reduce the manual labor of users, shorten the time of information collection and analysis, save the cost of manpower and material resources, and improve the working efficiency, compared with the traditional manual collection and analysis of users. And the deployment is fast, the maintenance is simple, obtained the user's praise.
【學(xué)位授予單位】：華南理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前8條

1 楊樹林;;正則表達(dá)式在網(wǎng)絡(luò)教學(xué)系統(tǒng)中的應(yīng)用[J];北京印刷學(xué)院學(xué)報(bào);2005年04期

2 賀令亞;柳佳剛;;基于Web的包裝器技術(shù)的現(xiàn)狀與發(fā)展[J];電腦開發(fā)與應(yīng)用;2007年06期

3 白紅哲,馬立勇;基于正則表達(dá)式的話務(wù)報(bào)告處理軟件的實(shí)現(xiàn)[J];通信管理與技術(shù);2005年02期

4 周源遠(yuǎn),王繼成,鄭剛,張福炎;Web頁面清洗技術(shù)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2002年09期

5 程沖,黃水清;利用正則表達(dá)式解析新聞網(wǎng)頁的算法研究[J];農(nóng)業(yè)圖書情報(bào)學(xué)刊;2005年04期

6 車萬翔,劉挺,李生;實(shí)體關(guān)系自動(dòng)抽取[J];中文信息學(xué)報(bào);2005年02期

7 俞鴻魁;張華平;劉群;呂學(xué)強(qiáng);施水才;;基于層疊隱馬爾可夫模型的中文命名實(shí)體識(shí)別[J];通信學(xué)報(bào);2006年02期

8 徐健;張智雄;吳振新;;實(shí)體關(guān)系抽取的技術(shù)方法綜述[J];現(xiàn)代圖書情報(bào)技術(shù);2008年08期

相關(guān)碩士學(xué)位論文前4條

1 鄒永強(qiáng);新聞網(wǎng)頁中人物實(shí)體關(guān)系提取技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年

2 徐芬;基于SVM和TSVM的中文實(shí)體關(guān)系抽取[D];國防科學(xué)技術(shù)大學(xué);2007年

3 雷佩瑩;基于Web的新聞信息抽取系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];西北大學(xué);2008年

4 黃鑫;基于特征向量的中文實(shí)體間語義關(guān)系抽取研究[D];蘇州大學(xué);2009年

本文編號(hào)：2416378

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2416378.html

上一篇：主題網(wǎng)絡(luò)蜘蛛搜索策略貪婪性解決方法
下一篇：運(yùn)營商,請(qǐng)把“根”留住

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

特定實(shí)體關(guān)系的識(shí)別和抽取及其系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)