Web中相關(guān)實(shí)體發(fā)現(xiàn)研究

發(fā)布時間：2018-05-02 11:24

本文選題：相關(guān)實(shí)體發(fā)現(xiàn) + 類型細(xì)化��；參考：《北京交通大學(xué)》2013年博士論文

【摘要】：隨著Internet和信息檢索技術(shù)的迅猛發(fā)展,Web成為人們獲取信息的重要途徑,而搜索引擎則成為從Web中獲取信息的重要工具。傳統(tǒng)的搜索方式是：用戶向搜索引擎(比如Google、百度)提交查詢,搜索引擎則依據(jù)提交的查詢給用戶返回一組相關(guān)文檔列表。但是很多時候用戶需求的并不是文檔本身,而是文檔中包含的實(shí)體信息。因此如何從眾多的Web文檔中找到用戶需求的實(shí)體信息成為近年來的研究熱點(diǎn),而相關(guān)實(shí)體發(fā)現(xiàn)研究正是針對用戶的這種特殊實(shí)體查詢需求而產(chǎn)生。相關(guān)實(shí)體發(fā)現(xiàn)是指給定一個由源實(shí)體、目標(biāo)類型和源實(shí)體與目標(biāo)實(shí)體的關(guān)系描述構(gòu)成的查詢,找到符合要求的一組實(shí)體。返回的實(shí)體需要滿足查詢要求的類型,但是給定的目標(biāo)類型經(jīng)常非常粗糙,這導(dǎo)致無法對得到的實(shí)體進(jìn)行準(zhǔn)確的類型判斷,針對這個問題我們做了如下的工作： 1)提出一種自動獲取細(xì)粒度目標(biāo)類型及其下義種子實(shí)體的方法。通過對查詢語句的句法分析獲取細(xì)粒度目標(biāo)類型,利用查詢模板獲取目標(biāo)類型的下義種子實(shí)體。 2)提出一種基于歸納法的細(xì)粒度目標(biāo)類型下義類別判別規(guī)則集合獲取方法,對于數(shù)量較少的種子實(shí)體,利用歸納法獲取細(xì)粒度目標(biāo)類型的下義類別判別規(guī)則集合。 3)提出一種基于特征提取的細(xì)粒度目標(biāo)類型下義類別判別規(guī)則集合獲取方法,對于數(shù)量較多的種子實(shí)體,利用學(xué)習(xí)到的最佳特征提取方法獲取細(xì)粒度目標(biāo)類型的下義類別判別規(guī)則集合。由于初始檢索到的候選實(shí)體是無序的,要想得到滿足用戶查詢要求的實(shí)體,必須對所有的候選實(shí)體進(jìn)行排序,針對該問題我們做了如下的工作： 1)提出了一種基于生成概率模型的實(shí)體排序方法。從實(shí)體相關(guān)度、實(shí)體類型相關(guān)度和實(shí)體關(guān)系相關(guān)度三方面的組合計(jì)算來對實(shí)體進(jìn)行排序,通過對比多種組合方法,獲取最佳的排序方法。對于實(shí)體類型相關(guān)度的計(jì)算使用了兩種方法,一種方法是基于歸納法獲取的細(xì)粒度目標(biāo)類型下義類別判別規(guī)則集合,利用不同的規(guī)則集合數(shù)進(jìn)行實(shí)體類型相關(guān)度計(jì)算,另一種方法是基于特征提取方法獲取的細(xì)粒度目標(biāo)類型下義類別判別規(guī)則集合。對于實(shí)體關(guān)系相關(guān)度計(jì)算,評估了兩種平滑方法對實(shí)體排序的影響,并提出了一種去停止詞重構(gòu)關(guān)系的實(shí)體關(guān)系相關(guān)度計(jì)算方法,提高了排序效果并降低了時間耗費(fèi)。 2)提出了一種基于馬爾可夫隨機(jī)場的實(shí)體排序方法。該方法將實(shí)體用文檔、類型和名稱三個屬性表示,利用學(xué)習(xí)到的最佳權(quán)重參數(shù)通過線性合并查詢與候選實(shí)體表示文檔的相關(guān)度、目標(biāo)類型與候選實(shí)體類型的相關(guān)度以及源實(shí)體與候選實(shí)體名稱的相關(guān)度來對實(shí)體進(jìn)行排序。相關(guān)實(shí)體發(fā)現(xiàn)任務(wù)中,實(shí)體被定義為由其唯一的主頁所表示,因此對所有的候選實(shí)體排序后,還要找到實(shí)體的主頁。針對實(shí)體的主頁查找問題,提出了一種查找方法,通過合并Web頁面的多屬性表示得分和實(shí)體的Wikipedia頁面外部鏈接得分來實(shí)現(xiàn)實(shí)體的主頁查找。實(shí)驗(yàn)結(jié)果表明,我們提出的方法可以有效的完成相關(guān)實(shí)體發(fā)現(xiàn)任務(wù),大量的減少用戶人工獲取相關(guān)實(shí)體信息的工作,并為用戶提供一個有效的結(jié)果。
[Abstract]:With the rapid development of Internet and information retrieval technology, Web has become an important way for people to obtain information, and search engines have become an important tool for obtaining information from Web. The traditional search method is: users submit queries to search engines (such as Google, Baidu), and search engines return a group of phases to users based on submission queries. Guan Wendang list. But most of the time the user needs not the document itself, but the entity information contained in the document. So how to find the entity information of the user needs from a large number of Web documents has become a hot spot of research in recent years, and the related entity discovery research is produced for the user's special entity query requirement. Closed entity discovery refers to a query consisting of a description of the source entity, the target type and the source entity, and a set of entities that meet the requirements.
The returned entity needs to meet the type of query requirements, but the given target type is often very rough, which leads to the inability to accurately type the obtained entity, and we do the following work for this problem:
1) a method of automatic acquisition of fine-grained target type and its underlying seed entity is proposed. By the syntactic analysis of query sentences, fine-grained target types are obtained, and a query template is used to obtain the underlying seed entity of the target type.
2) a method based on induction is proposed to obtain a set of fine category discriminant rule sets under fine grained target type. For a small number of seed entities, a set of lower sense category discriminant rules for fine-grained target types is obtained by induction.
3) a collection method based on feature extraction is proposed to obtain a set of semantic category discriminant rules set under fine grained target types. For a large number of seed entities, the best feature extraction method learned from learning is used to obtain a set of lower class discriminant rules for fine grained target types.
Since the initial retrieved candidate entities are unordered, to get the entity that meets the user's query requirements, all the candidate entities must be sorted. We have done the following work on the problem:
1) a kind of entity sorting method based on the generation probability model is proposed. The combination calculation of entity correlation degree, entity type correlation degree and entity relation correlation degree is used to sort the entity, and the best sorting method is obtained by comparing a variety of combination methods. Two methods are used for the calculation of entity type correlation. The method is a set of semantic category discrimination rules under the fine grained target type obtained by induction, and the correlation degree of entity type is calculated by different set of rule sets. The other is a set of semantic category discrimination rules under the fine-grained target type obtained by the feature extraction method. The evaluation of the correlation degree of entity relations is two. The effect of the smoothing method on the entity sorting is presented, and a method of calculating the correlation degree of the entity relation to stop the reconfiguration of the words is proposed, which improves the ranking effect and reduces the time consumption.
2) an entity sorting method based on Markov random field is proposed. This method represents the entity with three attributes of document, type and name, and the correlation degree of the document by linear merge query with the candidate entity, the correlation degree between the target type and the candidate entity type and the source entity and candidate. The correlation degree of the entity name is used to sort the entity.
In the related entity discovery task, the entity is defined as its unique home page, so after sorting all the candidate entities, the entity's main page is also found. A lookup method is proposed for the entity's home page finding problem by merging the multiple attribute table of the Web page and the external link score of the entity's Wikipedia page. To implement the home page lookup of the entity.
The experimental results show that the proposed method can effectively complete the related entity discovery tasks, reduce the work of the user to obtain the relevant entity information artificially, and provide an effective result for the user.

【學(xué)位授予單位】：北京交通大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2013
【分類號】：TP391.3

【共引文獻(xiàn)】

相關(guān)期刊論文前1條

1 周密;劉倩;梁安;;組織內(nèi)成員間知識共享的影響因素研究[J];管理學(xué)報(bào);2013年10期

相關(guān)博士學(xué)位論文前1條

1 裘麗;互聯(lián)網(wǎng)大規(guī)模公益協(xié)作機(jī)制研究[D];湖南大學(xué);2012年

相關(guān)碩士學(xué)位論文前1條

1 李源;虛擬團(tuán)隊(duì)中的社會惰性研究[D];大連理工大學(xué);2013年

，

本文編號：1833667

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1833667.html

上一篇：基于壓縮全文自索引的分布式索引技術(shù)研究
下一篇：求解PageRank問題的Arnoldi-MSI算法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web中相關(guān)實(shí)體發(fā)現(xiàn)研究