天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

生物醫(yī)學(xué)鏈接數(shù)據(jù)的清洗與集成技術(shù)研究

發(fā)布時(shí)間:2018-12-08 10:59
【摘要】:近年來(lái)語(yǔ)義網(wǎng)技術(shù)的高速發(fā)展方便了海量數(shù)據(jù)的集成與展示。生物醫(yī)學(xué)領(lǐng)域由于其數(shù)據(jù)量大及子領(lǐng)域多的特性,清洗并集成各機(jī)構(gòu)發(fā)布的RDF數(shù)據(jù)集的需求也日益凸顯。過往的許多工作致力于應(yīng)用語(yǔ)義網(wǎng)標(biāo)準(zhǔn)與技術(shù)為海量生物醫(yī)學(xué)數(shù)據(jù)建立鏈接數(shù)據(jù)網(wǎng)絡(luò)。例如采用語(yǔ)義網(wǎng)技術(shù)發(fā)布的生物醫(yī)學(xué)數(shù)據(jù)集通常提供了指向其他數(shù)據(jù)集的交叉引用,但是這些引用往往存在錯(cuò)誤,或是不能完整表達(dá)數(shù)據(jù)集間鏈接關(guān)系。已集成的數(shù)據(jù)需要通過使用SPARQL語(yǔ)言查詢的方式來(lái)獲取,阻礙了非語(yǔ)義網(wǎng)領(lǐng)域用戶(例如生物醫(yī)學(xué)領(lǐng)域?qū)I(yè)技術(shù)人員)對(duì)數(shù)據(jù)的使用。各數(shù)據(jù)集使用不同的本體也使得跨數(shù)據(jù)集查詢的結(jié)果難以集成。本文對(duì)生物醫(yī)學(xué)數(shù)據(jù)集鏈接數(shù)據(jù)進(jìn)行分析,并研究數(shù)據(jù)清洗及數(shù)據(jù)集成技術(shù)來(lái)解決上述問題。數(shù)據(jù)清洗技術(shù)對(duì)數(shù)據(jù)進(jìn)行分析與校驗(yàn),對(duì)重復(fù)數(shù)據(jù),錯(cuò)誤數(shù)據(jù)與缺失數(shù)據(jù)進(jìn)行修正。語(yǔ)義網(wǎng)數(shù)據(jù)集成技術(shù)涉及本體匹配,實(shí)體鏈接等技術(shù),其中本體匹配用于統(tǒng)一不同數(shù)據(jù)集本體的類與屬性,實(shí)體鏈接連接不同數(shù)據(jù)集中指向同一實(shí)體的數(shù)據(jù)。本文的主要貢獻(xiàn)如下:1.依托于Bio2RDF項(xiàng)目,調(diào)查并分析了主流生物醫(yī)學(xué)鏈接數(shù)據(jù)。構(gòu)建了數(shù)據(jù)集鏈接,實(shí)體鏈接及術(shù)語(yǔ)鏈接三類數(shù)據(jù)鏈接圖,分析了鏈接圖間關(guān)聯(lián)性,發(fā)現(xiàn)了數(shù)據(jù)集鏈接具有小世界現(xiàn)象,實(shí)體鏈接度分布不嚴(yán)格符合冪次定律,不同數(shù)據(jù)集間術(shù)語(yǔ)有較多重合等現(xiàn)象。文章還通過研究實(shí)體鏈接屬性,構(gòu)建了一個(gè)標(biāo)準(zhǔn)測(cè)試集用于評(píng)估實(shí)體鏈接方法的優(yōu)劣。鏈接分析方法可以通用于生物醫(yī)學(xué)領(lǐng)域數(shù)據(jù)集分析;2.對(duì)選定數(shù)據(jù)集進(jìn)行數(shù)據(jù)清洗,使用字符串檢測(cè),機(jī)器學(xué)習(xí)等方法對(duì)因?yàn)樽詣?dòng)轉(zhuǎn)換及人工輸入產(chǎn)生的錯(cuò)誤,補(bǔ)全缺失數(shù)據(jù),修正錯(cuò)誤數(shù)據(jù),消除重復(fù)數(shù)據(jù)。同時(shí)根據(jù)實(shí)體鏈接的對(duì)稱性和傳遞性分析補(bǔ)全數(shù)據(jù)集間缺失鏈接,修正錯(cuò)誤鏈接,提升了數(shù)據(jù)質(zhì)量及鏈接質(zhì)量;3.在一個(gè)基于本體的數(shù)據(jù)集聯(lián)合搜索引擎BioSearch系統(tǒng)中集成清洗后的數(shù)據(jù)集,使用本體匹配方法支持跨數(shù)據(jù)集聯(lián)合查詢。系統(tǒng)為用戶提供簡(jiǎn)單高效的數(shù)據(jù)查詢獲取界面。實(shí)驗(yàn)結(jié)果表明使用聯(lián)合查詢及使用本文定義的語(yǔ)義查詢接口比現(xiàn)有的兩種鏈接數(shù)據(jù)搜索引擎更加高效,BioSearch所實(shí)現(xiàn)刻面過濾及實(shí)體瀏覽功能也被證實(shí)提升了用戶體驗(yàn)。
[Abstract]:In recent years, the rapid development of semantic Web technology facilitates the integration and display of massive data. Due to the large amount of data and many sub-fields, the need of cleaning and integrating RDF data sets published by various organizations is increasingly prominent in the biomedical field. Many previous efforts have been devoted to the use of semantic Web standards and technologies to establish linked data networks for massive biomedical data. For example, biomedical data sets published using semantic Web technology usually provide cross-references to other data sets, but these references often contain errors or fail to fully express the link relationship between data sets. The integrated data needs to be obtained by using SPARQL language query, which hinders the use of data by non-semantic domain users (such as biomedical professionals). Different ontologies in different datasets also make it difficult to integrate the results of cross-dataset queries. This paper analyzes the linked data of biomedical data set, and studies data cleaning and data integration technology to solve the above problems. Data cleaning technology analyzes and verifies the data, and corrects the repeated data, error data and missing data. Semantic Web data integration technology involves ontology matching, entity linking and so on. Ontology matching is used to unify the classes and attributes of different datasets, and entity links connect different data sets to the same entity. The main contributions of this paper are as follows: 1. Based on the Bio2RDF project, the mainstream biomedical link data were investigated and analyzed. In this paper, three kinds of data link graphs, data set link, entity link and terminology link, are constructed, and the relationship between them is analyzed. It is found that the data set link has small world phenomenon, and the distribution of entity link degree is not strictly in accordance with power law. There is more overlap between different data sets. In addition, a standard test set is constructed to evaluate the merits and demerits of entity linking methods. Link analysis method can be used in biomedical domain data set analysis. 2. Data cleaning of selected data sets, string detection, machine learning and other methods to correct the missing data, correct the error data and eliminate the duplicate data caused by automatic conversion and manual input. At the same time, according to the symmetry and transitivity of the entity link, the missing link between the complete data sets is analyzed, and the error link is corrected to improve the data quality and link quality. 3. In an ontology-based data set federated search engine (BioSearch) system, the cleaned data set is integrated, and the ontology matching method is used to support cross-dataset joint query. The system provides users with a simple and efficient data query acquisition interface. The experimental results show that the joint query and semantic query interface defined in this paper are more efficient than the existing two linked data search engines. The facet filtering and entity browsing functions implemented by BioSearch have also been proved to improve the user experience.
【學(xué)位授予單位】:南京大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 張爾強(qiáng);創(chuàng)建SAS數(shù)據(jù)集的技巧[J];數(shù)理醫(yī)藥學(xué)雜志;2003年01期

2 ;數(shù)據(jù)集N鄽2[J];航空材料;1959年09期

3 江海洪 ,羅長(zhǎng)坤;首套中國(guó)數(shù)字化可視人體數(shù)據(jù)集在第三軍醫(yī)大學(xué)研制成功[J];中華醫(yī)學(xué)雜志;2003年09期

4 陳相穎;數(shù)據(jù)集記錄快速定位與篩選方法之探討[J];計(jì)量與測(cè)試技術(shù);2005年06期

5 張曉斌;魏永祥;韓德民;夏寅;李希平;原林;唐雷;王興海;;數(shù)字化耳鼻咽喉數(shù)據(jù)集的采集[J];中華耳鼻咽喉頭頸外科雜志;2005年06期

6 王宏鼎;唐世渭;董國(guó)田;;數(shù)據(jù)集成中數(shù)據(jù)集特征的檢測(cè)方法[J];中國(guó)金融電腦;2006年03期

7 張華;郁書好;;時(shí)空數(shù)據(jù)集的連接處理和優(yōu)化方法研究[J];皖西學(xué)院學(xué)報(bào);2006年02期

8 苗卿;單立新;裘昱;;信息熵在數(shù)據(jù)集分割中的應(yīng)用研究[J];電腦知識(shí)與技術(shù)(學(xué)術(shù)交流);2007年05期

9 陳德誠(chéng);丘平珠;唐炳莉;;廣西氣象數(shù)據(jù)集設(shè)計(jì)與制作[J];氣象研究與應(yīng)用;2007年04期

10 趙鳳英;王崇駿;陳世福;;用于不均衡數(shù)據(jù)集的挖掘方法[J];計(jì)算機(jī)科學(xué);2007年09期

相關(guān)會(huì)議論文 前10條

1 田捷;;三維醫(yī)學(xué)影像數(shù)據(jù)集處理的集成化平臺(tái)[A];2003年全國(guó)醫(yī)學(xué)影像技術(shù)學(xué)術(shù)會(huì)議論文匯編[C];2003年

2 范明;魏芳;;挖掘基本顯露模式用于分類[A];第二十一屆中國(guó)數(shù)據(jù)庫(kù)學(xué)術(shù)會(huì)議論文集(技術(shù)報(bào)告篇)[C];2004年

3 冷傳良;;飛機(jī)化銑成樣板劃線數(shù)據(jù)集設(shè)計(jì)方法探索[A];第十屆沈陽(yáng)科學(xué)學(xué)術(shù)年會(huì)論文集(信息科學(xué)與工程技術(shù)分冊(cè))[C];2013年

4 孟燁;張鵬;宋大為;王雷;;信息檢索系統(tǒng)性能對(duì)數(shù)據(jù)集特性的依賴性分析[A];第十二屆全國(guó)人機(jī)語(yǔ)音通訊學(xué)術(shù)會(huì)議(NCMMSC'2013)論文集[C];2013年

5 段磊;唐常杰;左R,

本文編號(hào):2368234


資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2368234.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4acf7***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com