網(wǎng)絡(luò)表格的實(shí)體列發(fā)現(xiàn)方法研究
本文選題:網(wǎng)絡(luò)表格 + 實(shí)體列。 參考:《北京交通大學(xué)》2017年碩士論文
【摘要】:互聯(lián)網(wǎng)上包含大量價(jià)值極高的網(wǎng)絡(luò)表格,機(jī)器卻無法理解,只有標(biāo)注出表格的語義信息,才能更好地利用這些結(jié)構(gòu)化數(shù)據(jù)。網(wǎng)絡(luò)表格的實(shí)體列在一定程度上表達(dá)了表格的語義信息,準(zhǔn)確地探測實(shí)體列能夠大大提升機(jī)器對表格語義的理解程度。目前,人們提出了基于知識(shí)庫的實(shí)體列發(fā)現(xiàn)方法,這些方法僅僅依靠表頭和知識(shí)庫信息的匹配情況來進(jìn)行實(shí)體列發(fā)現(xiàn),不僅對一些表頭語義模糊或者其表頭不存在于知識(shí)庫的表格無能為力,而且不能發(fā)現(xiàn)多實(shí)體列表格中的具體實(shí)體屬性關(guān)系,并且算法的準(zhǔn)確率和執(zhí)行時(shí)間方面的表現(xiàn)也不理想。本文提出基于屬性間依賴關(guān)系的實(shí)體列發(fā)現(xiàn)方法,主要研究工作如下:(1)提出一種基于屬性間依賴關(guān)系的實(shí)體列發(fā)現(xiàn)方法。該方法不依賴知識(shí)庫和表頭信息,不僅提高了實(shí)體列的發(fā)現(xiàn)效率,而且增強(qiáng)了算法適用性。(2)提出一種適應(yīng)網(wǎng)絡(luò)表格特點(diǎn)的近似函數(shù)依賴檢測方法?紤]表格中的噪聲因素,使其能更加準(zhǔn)確地表達(dá)網(wǎng)絡(luò)表格屬性間的函數(shù)依賴關(guān)系。(3)提出實(shí)體屬性依賴強(qiáng)度的概念,并由此定義實(shí)體列的語義強(qiáng)度。由實(shí)體屬性之間的依賴強(qiáng)度判斷實(shí)體列的語義強(qiáng)度,進(jìn)而提高最強(qiáng)實(shí)體列探測的準(zhǔn)確度。(4)在基于屬性間依賴關(guān)系算法的基礎(chǔ)上引入實(shí)體屬性依賴強(qiáng)度的概念。不僅可以按照實(shí)體列的語義強(qiáng)度進(jìn)行實(shí)體列發(fā)現(xiàn),而且還能夠根據(jù)實(shí)體屬性的依賴強(qiáng)度標(biāo)注具體關(guān)系。大量實(shí)驗(yàn)結(jié)果表明,本文提出的近似函數(shù)依賴檢測方法具有明顯的降噪作用。本文提出的基于屬性間依賴關(guān)系的實(shí)體列發(fā)現(xiàn)方法均在有效性和時(shí)間效率上有優(yōu)秀的表現(xiàn),并且適用性更強(qiáng)。
[Abstract]:The Internet contains a large number of high value network tables, but the machine can not understand them. Only by marking the semantic information of the tables can we make better use of these structured data.The entity column of the network table expresses the semantic information of the table to a certain extent, and the accurate detection of the entity column can greatly improve the machine's understanding of the table semantics.At present, people put forward entity column discovery methods based on knowledge base. These methods only rely on the matching of header and knowledge base information to carry out entity column discovery.Not only the semantic ambiguity of some table heads or tables whose heads do not exist in the knowledge base are powerless, but also the specific entity attribute relationships in multi-entity column tables can not be found, and the performance of the algorithm in terms of accuracy and execution time is not satisfactory.In this paper, an entity column discovery method based on attribute dependency is proposed. The main research work is as follows: 1) an entity column discovery method based on attribute dependency is proposed.This method does not rely on knowledge base and header information. It not only improves the efficiency of entity column discovery, but also enhances the applicability of the algorithm.Considering the noise factor in the table, it can more accurately express the functional dependency relationship between the attributes of the network table.) the concept of entity attribute dependency intensity is proposed, and the semantic strength of the entity column is defined.The semantic strength of the entity column is judged by the dependency strength between the entity attributes, and the accuracy of the strongest entity column detection is improved. (4) the concept of entity attribute dependency strength is introduced based on the algorithm based on the dependency relationship between the attributes.Not only can the entity column be discovered according to the semantic strength of the entity column, but also the specific relationship can be labeled according to the dependent strength of the entity attribute.A large number of experimental results show that the proposed approximate function dependence detection method has obvious noise reduction effect.The method of entity column discovery based on attribute dependency in this paper has excellent performance in efficiency and time efficiency, and is more applicable.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP393.0
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 黎章海;潘久輝;;基于函數(shù)依賴的導(dǎo)出關(guān)系候選碼計(jì)算[J];計(jì)算機(jī)工程;2016年05期
2 李衛(wèi)榜;李戰(zhàn)懷;陳群;楊婧穎;姜濤;;分布式大數(shù)據(jù)不一致性檢測[J];軟件學(xué)報(bào);2016年08期
3 孫紀(jì)舟;李建中;高宏;劉顯敏;;微函數(shù)依賴及其推理[J];計(jì)算機(jī)學(xué)報(bào);2016年10期
4 苗東菁;劉顯敏;李建中;;概率數(shù)據(jù)庫中近似函數(shù)依賴挖掘算法[J];計(jì)算機(jī)研究與發(fā)展;2015年12期
5 賈長云;程永上;;HTML表格向XML的智能轉(zhuǎn)換[J];計(jì)算機(jī)工程;2009年14期
6 任仲晟;薛永生;;基于頁面標(biāo)簽的Web結(jié)構(gòu)化數(shù)據(jù)抽取[J];計(jì)算機(jī)科學(xué);2007年10期
7 張守志,施伯樂;一種發(fā)現(xiàn)函數(shù)依賴集的方法及應(yīng)用[J];軟件學(xué)報(bào);2003年10期
相關(guān)碩士學(xué)位論文 前1條
1 任向冉;網(wǎng)絡(luò)表格的實(shí)體列發(fā)現(xiàn)與標(biāo)識(shí)[D];北京交通大學(xué);2015年
,本文編號(hào):1748685
本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/1748685.html