面向大規(guī)模知識圖譜的分布式查詢技術研究
發(fā)布時間:2019-06-28 16:46
【摘要】:隨著大數(shù)據(jù)時代的到來,人們所采集的數(shù)據(jù)量已達到ZB級規(guī)模。為了精確查詢數(shù)據(jù),越來越多的搜索引擎采用知識圖譜作為底層數(shù)據(jù)支撐。知識圖譜是描述現(xiàn)實世界中地點、人物、城市、電影等事物以及事物間聯(lián)系的關系網絡。利用知識圖譜,搜索引擎可挖掘事物之間的內在聯(lián)系,更準確地查找用戶所需的信息。目前知識圖譜中的數(shù)據(jù)主要從Wikipedia等知識百科中自動采集,存在大量未經驗證的信息,導致知識圖譜呈現(xiàn)出噪聲數(shù)據(jù)多且數(shù)據(jù)規(guī)模大的特征,這些特征使得用戶難以快速獲取滿意的查詢結果。針對以上特征,如何實現(xiàn)快速高效的知識圖譜查詢是當前學術界和工業(yè)界亟待解決的問題。現(xiàn)有工作通常將知識圖譜查詢建模成子圖匹配問題,并已取得一定進展,但仍存在諸多不足。首先,現(xiàn)有查詢模型大多要求查詢結果與用戶查詢精確匹配,但是由于知識圖譜存在噪聲數(shù)據(jù),這些模型會遺漏用戶感興趣的查詢結果,存在可用性差的問題。其次,為了加快查詢速度,現(xiàn)有查詢算法普遍采用圖索引技術,但是知識圖譜的數(shù)據(jù)規(guī)模大,為其建立圖索引需耗費高昂的時間和空間開銷。最后,由于知識圖譜規(guī)模龐大,所以需要采用分布式的方式實現(xiàn)查詢過程,然而現(xiàn)有的分布式圖數(shù)據(jù)處理平臺未針對知識圖譜查詢的執(zhí)行過程進行優(yōu)化,存在執(zhí)行效率低下的問題。因此,需設計新型的知識圖譜查詢模型、算法和計算平臺以應對以上挑戰(zhàn)。本文針對知識圖譜噪聲數(shù)據(jù)多、數(shù)據(jù)規(guī)模大的特征,分別從知識圖譜查詢模型、分布式查詢算法、分布式查詢執(zhí)行優(yōu)化三個層面對知識圖譜查詢問題展開研究,旨在提供快速高效的新型分布式查詢技術。第一,提出一種面向知識圖譜的查詢模型,基于模糊匹配的思想屏蔽噪聲數(shù)據(jù),始終保證返回滿意的查詢結果。第二,基于本文所提的查詢模型,設計一種免索引的分布式查詢算法,通過新型的限界技術優(yōu)化查詢時間,利用分布式環(huán)境的計算能力加快查詢速度,達到快速響應查詢請求的目的。第三,在分布式圖數(shù)據(jù)處理平臺上,分別從作業(yè)調度和數(shù)據(jù)存儲兩個方面優(yōu)化分布式知識圖譜查詢的執(zhí)行效率,減少數(shù)據(jù)I/0的開銷,進一步縮短查詢的整體完成時間。在理論研究的基礎上,設計與實現(xiàn)面向大規(guī)模知識圖譜的搜索引擎原型系統(tǒng),部署面向學術文獻知識圖譜的查詢應用,以驗證本文的理論成果的有效性。綜上所述,本文針對知識圖譜的兩個特征,提出快速高效的分布式查詢技術,保證用戶可以快速獲取滿意的查詢結果,為下一代搜索引擎提供行之有效的解決方案。隨著知識圖譜的不斷普及,本文的研究成果將應用于商業(yè)、金融、生命科學等諸多領域,為商業(yè)決策、金融分析、生物制藥等應用提供有效的數(shù)據(jù)查詢支持,具有重大的社會意義。
[Abstract]:With the advent of big data era, the amount of data collected by people has reached the ZB level. In order to query data accurately, more and more search engines use knowledge graph as the underlying data support. Knowledge graph is a network of places, characters, cities, movies and the relationship between things in the real world. By using knowledge graph, search engine can mine the internal relationship between things and find the information needed by users more accurately. At present, the data in the knowledge graph are mainly collected automatically from Wikipedia and other knowledge encyclopedia, and there are a lot of unverified information, which leads to the characteristics of large noise data and large data scale in the knowledge graph, which makes it difficult for users to obtain satisfactory query results quickly. In view of the above characteristics, how to realize fast and efficient knowledge graph query is an urgent problem to be solved in academic and industrial circles. At present, knowledge graph query is usually modeled as subgraph matching problem, and some progress has been made, but there are still many shortcomings. First of all, most of the existing query models require that the query results match the user query accurately, but because of the noise data in the knowledge graph, these models will miss the query results that users are interested in, and there is a problem of poor availability. Secondly, in order to speed up the query speed, the existing query algorithms generally use graph index technology, but the data scale of knowledge graph is large, so it takes a high time and space cost to establish graph index for it. Finally, because of the large scale of knowledge graph, it is necessary to realize the query process in a distributed way. However, the existing distributed map data processing platform does not optimize the execution process of knowledge graph query, and there is a problem of low execution efficiency. Therefore, it is necessary to design a new knowledge graph query model, algorithm and computing platform to meet the above challenges. In view of the characteristics of knowledge graph noise data and large data scale, this paper studies the knowledge graph query problem from three aspects: knowledge graph query model, distributed query algorithm and distributed query execution optimization, in order to provide a new fast and efficient distributed query technology. First, a knowledge graph oriented query model is proposed, which shielded noise data based on fuzzy matching and always guaranteed to return satisfactory query results. Secondly, based on the query model proposed in this paper, an index-free distributed query algorithm is designed. The query time is optimized by a new bound technology, and the query speed is accelerated by using the computing power of distributed environment, so as to achieve the purpose of responding to query requests quickly. Thirdly, on the distributed map data processing platform, the execution efficiency of distributed knowledge graph query is optimized from two aspects of job scheduling and data storage, the overhead of data I 鈮,
本文編號:2507457
[Abstract]:With the advent of big data era, the amount of data collected by people has reached the ZB level. In order to query data accurately, more and more search engines use knowledge graph as the underlying data support. Knowledge graph is a network of places, characters, cities, movies and the relationship between things in the real world. By using knowledge graph, search engine can mine the internal relationship between things and find the information needed by users more accurately. At present, the data in the knowledge graph are mainly collected automatically from Wikipedia and other knowledge encyclopedia, and there are a lot of unverified information, which leads to the characteristics of large noise data and large data scale in the knowledge graph, which makes it difficult for users to obtain satisfactory query results quickly. In view of the above characteristics, how to realize fast and efficient knowledge graph query is an urgent problem to be solved in academic and industrial circles. At present, knowledge graph query is usually modeled as subgraph matching problem, and some progress has been made, but there are still many shortcomings. First of all, most of the existing query models require that the query results match the user query accurately, but because of the noise data in the knowledge graph, these models will miss the query results that users are interested in, and there is a problem of poor availability. Secondly, in order to speed up the query speed, the existing query algorithms generally use graph index technology, but the data scale of knowledge graph is large, so it takes a high time and space cost to establish graph index for it. Finally, because of the large scale of knowledge graph, it is necessary to realize the query process in a distributed way. However, the existing distributed map data processing platform does not optimize the execution process of knowledge graph query, and there is a problem of low execution efficiency. Therefore, it is necessary to design a new knowledge graph query model, algorithm and computing platform to meet the above challenges. In view of the characteristics of knowledge graph noise data and large data scale, this paper studies the knowledge graph query problem from three aspects: knowledge graph query model, distributed query algorithm and distributed query execution optimization, in order to provide a new fast and efficient distributed query technology. First, a knowledge graph oriented query model is proposed, which shielded noise data based on fuzzy matching and always guaranteed to return satisfactory query results. Secondly, based on the query model proposed in this paper, an index-free distributed query algorithm is designed. The query time is optimized by a new bound technology, and the query speed is accelerated by using the computing power of distributed environment, so as to achieve the purpose of responding to query requests quickly. Thirdly, on the distributed map data processing platform, the execution efficiency of distributed knowledge graph query is optimized from two aspects of job scheduling and data storage, the overhead of data I 鈮,
本文編號:2507457
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2507457.html