特定實(shí)體關(guān)系的識(shí)別和抽取及其系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
[Abstract]:With the progress of Internet technology, the Internet has become an indispensable part of people's work and life. The biggest advantage of the Internet is that there is a huge amount of information for users to use. However, mass information also brings the problem of information search. The appearance of search engine provides users with a simple and fast way to search for information. By submitting search keywords, users can use search engines to retrieve the content related to keywords in a large amount of information, and get the link address of the content page. However, even with the help of search engines, the accuracy of search results can be difficult to satisfy users, especially if they are searching for specific information in a particular domain and their relationships. Search engine results usually need to be manually searched and analyzed. Based on the investigation of users' daily work, this paper studies the extraction of specific entities of interest to users and the extraction of relations between specific entities, and analyzes the information distribution characteristics of fixed format web pages. A search keyword constructor is designed and implemented according to the analysis of the user's requirements, using the regular expression matching technology to extract the specific entity information, and the web page source file is directly treated as a character stream, and a search keyword constructor is designed and implemented based on the analysis of the user's requirements. Through configurable combination of basic keywords and special keywords, different search requests are submitted to search engines to obtain more comprehensive and non-fixed web search results. In the identification and extraction of specific entity relationships, HTMLParser is used for page processing to extract the text information returned by the general search engine URL and URL pointing to the page. The segmentation system of Chinese Academy of Sciences is used to deal with Chinese word segmentation and part of speech tagging. Use regular expressions to extract e-mail entities from text. Finally, according to the characteristics of the combination of pinyin of Chinese names and the prefixes of mailbox, the relationship between specific entities is extracted by the set extraction rules. This paper also designs and implements a usable information extraction system of B / S structure. The system is developed with JAVA language, including three main modules: user interface module, specific entity extraction module and specific entity relation extraction module. The user can call the function of the other two modules through the interface module to realize the automatic extraction of information. The information extraction system realized in this paper can greatly reduce the manual labor of users, shorten the time of information collection and analysis, save the cost of manpower and material resources, and improve the working efficiency, compared with the traditional manual collection and analysis of users. And the deployment is fast, the maintenance is simple, obtained the user's praise.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 楊樹林;;正則表達(dá)式在網(wǎng)絡(luò)教學(xué)系統(tǒng)中的應(yīng)用[J];北京印刷學(xué)院學(xué)報(bào);2005年04期
2 賀令亞;柳佳剛;;基于Web的包裝器技術(shù)的現(xiàn)狀與發(fā)展[J];電腦開發(fā)與應(yīng)用;2007年06期
3 白紅哲,馬立勇;基于正則表達(dá)式的話務(wù)報(bào)告處理軟件的實(shí)現(xiàn)[J];通信管理與技術(shù);2005年02期
4 周源遠(yuǎn),王繼成,鄭剛,張福炎;Web頁面清洗技術(shù)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2002年09期
5 程沖,黃水清;利用正則表達(dá)式解析新聞網(wǎng)頁的算法研究[J];農(nóng)業(yè)圖書情報(bào)學(xué)刊;2005年04期
6 車萬翔,劉挺,李生;實(shí)體關(guān)系自動(dòng)抽取[J];中文信息學(xué)報(bào);2005年02期
7 俞鴻魁;張華平;劉群;呂學(xué)強(qiáng);施水才;;基于層疊隱馬爾可夫模型的中文命名實(shí)體識(shí)別[J];通信學(xué)報(bào);2006年02期
8 徐健;張智雄;吳振新;;實(shí)體關(guān)系抽取的技術(shù)方法綜述[J];現(xiàn)代圖書情報(bào)技術(shù);2008年08期
相關(guān)碩士學(xué)位論文 前4條
1 鄒永強(qiáng);新聞網(wǎng)頁中人物實(shí)體關(guān)系提取技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年
2 徐芬;基于SVM和TSVM的中文實(shí)體關(guān)系抽取[D];國防科學(xué)技術(shù)大學(xué);2007年
3 雷佩瑩;基于Web的新聞信息抽取系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];西北大學(xué);2008年
4 黃鑫;基于特征向量的中文實(shí)體間語義關(guān)系抽取研究[D];蘇州大學(xué);2009年
本文編號(hào):2416378
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2416378.html