基于Web文本和知識(shí)圖譜的實(shí)體摘要
本文選題:實(shí)體摘要 + 詞向量 ; 參考:《華東師范大學(xué)》2016年博士論文
【摘要】:在物聯(lián)網(wǎng)、互聯(lián)網(wǎng)和云計(jì)算深度融合的背景下,半結(jié)構(gòu)化、非結(jié)構(gòu)化的Web數(shù)據(jù)暴增。用戶在進(jìn)行信息檢索時(shí),很容易迷航在海量異構(gòu)的碎片化數(shù)據(jù)中,如何快速、精準(zhǔn)地幫助用戶定位到他們感興趣的Web實(shí)體或者知識(shí)成為亟待解決的問題之一。一方面,傳統(tǒng)的信息檢索系統(tǒng)旨在獲取海量與查詢相關(guān)的Web文本,而缺乏對(duì)文本語義的概括能力。另一方面:為探索非結(jié)構(gòu)化文本中的語義信息,許多知識(shí)圖譜整合了億萬級(jí)實(shí)體、屬性以及關(guān)系。然而,,面對(duì)如此龐大和異構(gòu)的碎片化信息,如何幫助用戶進(jìn)行知識(shí)導(dǎo)航仍然是一個(gè)挑戰(zhàn)。所以,本文以文本和知識(shí)圖譜上實(shí)體摘要技術(shù)為研究對(duì)象,以應(yīng)對(duì)信息過載和用戶迷航問題。本文針對(duì)海量Web文本的動(dòng)態(tài)特征,首先提出了基于文本的事件實(shí)體摘要的算法;其次,針對(duì)用戶個(gè)性化需求,設(shè)計(jì)了知識(shí)圖譜中上下文感知的實(shí)體摘要方法;最后,針對(duì)碎片化信息的異構(gòu)性和不完備性,提出了跨知識(shí)圖譜的實(shí)體摘要算法。主要貢獻(xiàn)包括以下幾個(gè)方面:·針對(duì)文本數(shù)據(jù)的海量性和動(dòng)態(tài)性,提出了基于文本的事件實(shí)體摘要算法。Web2.0時(shí)代,不僅同一事件的描述碎片化地分散在不同的Web數(shù)據(jù)源中,而且在事件的不同發(fā)展階段信息碎片化現(xiàn)象更為嚴(yán)重。本文利用主題聚類模型挖掘這些事件,針對(duì)每個(gè)事件,將事件摘要建模成集合覆蓋問題,設(shè)計(jì)并實(shí)現(xiàn)了貪心算法解決這個(gè)NP-hard問題,以生成對(duì)事件的摘要!め槍(duì)用戶的智能化需求,在知識(shí)圖譜上設(shè)計(jì)了上下文感知的實(shí)體摘要算法。為應(yīng)對(duì)知識(shí)圖譜上知識(shí)過載和迷航的問題,本文基于用戶查詢歷史,利用主題模型生成用戶偏好,并以此為基礎(chǔ)響應(yīng)用戶智能化知識(shí)導(dǎo)航的需求,基于Markov模型設(shè)計(jì)了上下文感知的實(shí)體摘要算法!め槍(duì)知識(shí)圖譜的異構(gòu)性和不完備性,提出了跨知識(shí)圖譜的實(shí)體摘要算法。不同知識(shí)圖譜對(duì)實(shí)體的描述不僅可能相互補(bǔ)充,而且可以相互佐證,幫助用戶獲取到更為準(zhǔn)確的查詢結(jié)果。本文基于詞向量技術(shù)實(shí)現(xiàn)了知識(shí)圖譜間實(shí)體匹配和融合技術(shù),并在此基礎(chǔ)上響應(yīng)用戶的實(shí)體摘要需求。本文提出的算法不僅能整合多個(gè)知識(shí)圖譜,而且提高了實(shí)體摘要算法的知識(shí)覆蓋率和摘要質(zhì)量!め槍(duì)數(shù)據(jù)碎片化特點(diǎn),設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)實(shí)體摘要演示系統(tǒng);诒疚脑O(shè)計(jì)并實(shí)現(xiàn)的三個(gè)實(shí)體摘要算法,以及其他文本挖掘和自然語言處理工具,構(gòu)建了一個(gè)以實(shí)體摘要為中心、分布式和四層架構(gòu)的的Web演示系統(tǒng)EntitySum-marizer。它能夠分析用戶給定的查詢,識(shí)別用戶感興趣的Web實(shí)體,并根據(jù)本文提出的技術(shù)生成多種實(shí)體摘要。此外,它還支持對(duì)摘要關(guān)鍵詞生成和事件時(shí)間線生成等生成摘要的文本分析。本文所提出的實(shí)體摘要方法不僅能夠緩解信息碎片化帶來的信息過載和知識(shí)迷航問題,設(shè)計(jì)并實(shí)現(xiàn)的演示系統(tǒng)為研究用戶多樣化的實(shí)體摘要需求提供了數(shù)據(jù)準(zhǔn)備和示范作用。
[Abstract]:In the context of the deep integration of the Internet of things, the Internet and cloud computing, semi-structured, unstructured Web data has exploded. In the process of information retrieval, it is easy to misunderstand how to quickly and accurately locate the Web entities or knowledge that users are interested in a large number of heterogeneous fragmented data, which becomes one of the problems to be solved urgently. On the one hand, the traditional information retrieval system aims to obtain a large number of Web texts related to query, but lacks the ability to generalize text semantics. On the other hand, in order to explore semantic information in unstructured text, many knowledge maps integrate billions of entities, attributes and relationships. However, in the face of such a large and heterogeneous fragmentation of information, how to help users navigate knowledge remains a challenge. Therefore, this paper takes the entity abstract technology on text and knowledge map as the research object to deal with the problem of information overload and user confusion. Aiming at the dynamic features of massive Web texts, this paper firstly proposes a text-based event entity summary algorithm; secondly, according to the user's personalized requirements, a context-aware entity summary method in knowledge atlas is designed. Aiming at the heterogeneity and incompleteness of fragmented information, an entity summary algorithm across knowledge atlas is proposed. The main contributions are as follows: in view of the magnanimity and dynamic nature of text data, a text-based event entity summary algorithm, Web 2.0, is proposed, in which not only the description of the same event is fragmented in different Web data sources. Moreover, the phenomenon of information fragmentation in different stages of events is more serious. In this paper, the topic clustering model is used to mine these events. For each event, the event summary is modeled as a set overlay problem, and a greedy algorithm is designed and implemented to solve the NP-hard problem. In order to generate a summary of events, a context-aware entity summary algorithm is designed based on the knowledge graph to meet the intelligent requirements of users. In order to deal with the problem of knowledge overload and confusion on knowledge map, based on the query history of users, this paper uses the topic model to generate user preferences, and on this basis responds to the demand of intelligent knowledge navigation of users. Based on Markov model, a context-aware entity summary algorithm is designed, and an entity summary algorithm across knowledge atlas is proposed to deal with the heterogeneity and incompleteness of knowledge atlas. The descriptions of entities in different knowledge maps may not only complement each other but also corroborate each other and help users obtain more accurate query results. In this paper, the entity matching and fusion technology among knowledge maps is realized based on word vector technology, and the entity summary requirements of users are responded to on this basis. The algorithm proposed in this paper can not only integrate multiple knowledge maps, but also improve the knowledge coverage and summary quality of the entity summary algorithm. According to the characteristics of data fragmentation, an entity summary demonstration system is designed and implemented. Based on the three entity summary algorithms designed and implemented in this paper, as well as other text mining and natural language processing tools, an entity Sum-marizer-based distributed and four-tier Web presentation system is constructed. It can analyze the query given by the user, identify the Web entity of interest to the user, and generate a variety of entity abstracts according to the technology proposed in this paper. In addition, it supports text analysis of summary keyword generation and event timeline generation. The entity summary method proposed in this paper can not only alleviate the problem of information overload and knowledge confusion caused by fragmentation of information, but also provide data preparation and demonstration for the study of user's diverse entity summary requirements.
【學(xué)位授予單位】:華東師范大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫鐵利;王圓;;一個(gè)基于人工神經(jīng)網(wǎng)絡(luò)的Web文本過濾系統(tǒng)[J];計(jì)算機(jī)時(shí)代;2006年06期
2 李光敏;許新山;熊旭輝;;Web文本情感分析研究綜述[J];現(xiàn)代情報(bào);2014年05期
3 劉明吉;饒一梅;王秀峰;黃亞樓;;基于模糊近似度的Web文本過濾模型[J];計(jì)算機(jī)科學(xué);2001年12期
4 王序臻;;Web文本層次分類方法研究[J];溫州職業(yè)技術(shù)學(xué)院學(xué)報(bào);2008年03期
5 鄒志華;田生偉;禹龍;馮冠軍;;改進(jìn)的維吾爾語Web文本后綴樹聚類[J];中文信息學(xué)報(bào);2013年02期
6 王景中;郭兆亮;;基于分層的中文Web文本內(nèi)容過濾研究[J];網(wǎng)絡(luò)安全技術(shù)與應(yīng)用;2012年11期
7 曹建芳;王鴻斌;;一種新的基于SVM-KNN的Web文本分類算法[J];計(jì)算機(jī)與數(shù)字工程;2010年04期
8 李澤峰;王煜;;基于RBF神經(jīng)網(wǎng)絡(luò)和關(guān)聯(lián)規(guī)則的Web文本分類規(guī)則獲取方法[J];圖書情報(bào)工作;2006年10期
9 王健;韓廣琳;;基于統(tǒng)計(jì)的Web文本自動(dòng)摘要技術(shù)分析[J];福建電腦;2007年08期
10 翁_g;胡長(zhǎng)軍;席強(qiáng);張學(xué)春;;一種面向e-Science環(huán)境的多領(lǐng)域Web文本特征抽取模型[J];小型微型計(jì)算機(jī)系統(tǒng);2011年01期
相關(guān)會(huì)議論文 前3條
1 劉斕冰;魏桂英;;Web文本信息挖掘技術(shù)[A];全國(guó)第十屆企業(yè)信息化與工業(yè)工程學(xué)術(shù)年會(huì)論文集[C];2006年
2 于海燕;陳曉江;馮健;房鼎益;;Web文本內(nèi)容過濾方法的研究[A];2006年全國(guó)開放式分布與并行計(jì)算學(xué)術(shù)會(huì)議論文集(一)[C];2006年
3 袁志堅(jiān);賈焰;;基于誤差反饋的高速Web文本流快速近似分類[A];第二十四屆中國(guó)數(shù)據(jù)庫(kù)學(xué)術(shù)會(huì)議論文集(研究報(bào)告篇)[C];2007年
相關(guān)博士學(xué)位論文 前2條
1 閆季鴻;基于Web文本和知識(shí)圖譜的實(shí)體摘要[D];華東師范大學(xué);2016年
2 王占一;Web文本挖掘中若干問題的研究[D];北京郵電大學(xué);2012年
相關(guān)碩士學(xué)位論文 前8條
1 于帥;中文Web文本情感傾向性分析技術(shù)的研究[D];哈爾濱工程大學(xué);2013年
2 尹麗玲;基于人工免疫算法的Web文本挖掘研究[D];哈爾濱工程大學(xué);2010年
3 郭凱;面向Web文本的數(shù)據(jù)清洗關(guān)鍵技術(shù)的研究與實(shí)現(xiàn)[D];西安電子科技大學(xué);2009年
4 鄧琨;基于Rough集的Web文本分類及其信息抽取研究[D];南昌大學(xué);2007年
5 桂海霞;利用表格等信息的Web文本分類研究與實(shí)現(xiàn)[D];安徽理工大學(xué);2008年
6 張諶奇;支持向量機(jī)在Web文本分類中的分析與應(yīng)用[D];暨南大學(xué);2008年
7 衛(wèi)莉莉;面向領(lǐng)域的Web文本采集與分類[D];西安建筑科技大學(xué);2011年
8 張宏兵;Web文本挖掘技術(shù)在網(wǎng)頁(yè)推薦中的應(yīng)用研究[D];南京理工大學(xué);2013年
本文編號(hào):2078624
本文鏈接:http://www.sikaile.net/shoufeilunwen/xxkjbs/2078624.html