主題模型在基因語(yǔ)義相似度計(jì)算中的應(yīng)用與研究
發(fā)布時(shí)間:2018-01-20 13:40
本文關(guān)鍵詞: 基因本體 語(yǔ)義相似性 LDA BTM 主題模型 出處:《華東師范大學(xué)》2017年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:近年來(lái),當(dāng)生物學(xué)家發(fā)現(xiàn)未知基因時(shí),往往將它們與已知基因進(jìn)行比較,然后根據(jù)兩者之間的相似性來(lái)推斷未知基因的特性。生物學(xué)者通過(guò)比對(duì)算法來(lái)對(duì)基因序列或結(jié)構(gòu)進(jìn)行比較,進(jìn)而查找出與其功能上相似或者相關(guān)的基因。但研究表明,在功能上相似或者相關(guān)的基因在序列上并非一定具有很大的相關(guān)性。針對(duì)上述問(wèn)題,目前的主流方法是通過(guò)計(jì)算基因在基因本體中注釋到的術(shù)語(yǔ)之間的語(yǔ)義相似性來(lái)分析和預(yù)測(cè)未知基因的特性。但是這類(lèi)方法僅僅利用術(shù)語(yǔ)在基因本體中的關(guān)聯(lián)關(guān)系間接地反映基因的語(yǔ)義相似度,而沒(méi)有涉及到術(shù)語(yǔ)本身所包含的內(nèi)在語(yǔ)義內(nèi)涵。本文提出一種基于主題模型的基因語(yǔ)義相似度算法,從代表術(shù)語(yǔ)的文本中挖掘出內(nèi)在的語(yǔ)義信息,在一定程度上解決了傳統(tǒng)方法中的不足。本文主要有如下三個(gè)創(chuàng)新點(diǎn):1.在計(jì)算術(shù)語(yǔ)對(duì)之間相似度時(shí),從基因注釋到的術(shù)語(yǔ)本身去挖掘潛在的語(yǔ)義信息,然后將代表術(shù)語(yǔ)語(yǔ)義信息的文本轉(zhuǎn)化為高維的主題向量,從而將術(shù)語(yǔ)之間的相似度轉(zhuǎn)化為代表術(shù)語(yǔ)的高維主題向量之間的相似度。2.提出SSGTLDA和SSGTBTM兩個(gè)模型:對(duì)于通過(guò)Google搜索引擎得到的術(shù)語(yǔ)長(zhǎng)文本信息,SSGTLDA模型對(duì)文本-主題關(guān)系和主題-詞關(guān)系進(jìn)行建模,最終得到術(shù)語(yǔ)文本的高維主題向量;對(duì)于通過(guò)基因本體的定義信息得到的術(shù)語(yǔ)短文本信息,SSGTBTM模型對(duì)整個(gè)術(shù)語(yǔ)語(yǔ)料庫(kù)中的詞對(duì)進(jìn)行建模,最終得到術(shù)語(yǔ)文本的高維主題向量。3.實(shí)現(xiàn)SSGTLDA和SSGTBTM兩種基因語(yǔ)義相似度計(jì)算方法,并分別在術(shù)語(yǔ)對(duì)和蛋白質(zhì)對(duì)兩種數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn)。實(shí)驗(yàn)結(jié)果表明本文提出的兩種算法均具有較好的效果。
[Abstract]:In recent years, when biologists discover unknown genes, they are often compared with known genes. Then according to the similarity between the two to infer the characteristics of unknown genes. Biologists compare the sequence or structure of genes by comparison algorithm. And then find out the similar or related genes. But the study shows that the functional similarity or related genes are not necessarily very relevant in the sequence. In view of the above problems. The current mainstream method is to analyze and predict the characteristics of unknown genes by calculating the semantic similarity between the terms annotated in the gene body. However, such methods use only the association of terms in the gene body. The lines indirectly reflect the semantic similarity of genes. In this paper, a gene semantic similarity algorithm based on topic model is proposed to extract the intrinsic semantic information from the text representing the terms. To some extent, the shortcomings of the traditional methods are solved. This paper mainly has three innovations: 1.When calculating the similarity between terms pairs, we mine the potential semantic information from the terms themselves. Then the text representing the semantic information of terms is transformed into a high-dimensional topic vector. Thus, the similarity between terms is transformed into the similarity between the high-dimensional subject vectors representing the terms. 2. Two models, SSGTLDA and SSGTBTM, are proposed. For the term long text information obtained through the Google search engine. SSGTLDA model models the text-topic relationship and subject-word relationship, and finally gets the high-dimensional topic vector of the terminology text. For the term short text information obtained from the definition information of the gene ontology, the term pairs in the whole term corpus are modeled by the SSGTBTM model. Finally, the high-dimensional topic vector. 3 of the terminology text is obtained. Two methods of gene semantic similarity calculation, SSGTLDA and SSGTBTM, are implemented. The experiments are carried out on two kinds of data sets: term pair and protein pair. The experimental results show that the two algorithms proposed in this paper have good results.
【學(xué)位授予單位】:華東師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 夏利玲;;淺談中文分詞技術(shù)[J];科技資訊;2011年32期
2 ;A measure of semantic similarity between gene ontology terms based on semantic pathway covering[J];Progress in Natural Science;2006年07期
3 李榮;曹順良;李園園;譚灝;朱揚(yáng)勇;鐘揚(yáng);李亦學(xué);;基于語(yǔ)義路徑覆蓋的Gene Ontology術(shù)語(yǔ)間語(yǔ)義相似性度量方法[J];自然科學(xué)進(jìn)展;2006年07期
4 張春霆;生物信息學(xué)的現(xiàn)狀與展望[J];世界科技研究與發(fā)展;2000年06期
5 解濤,梁衛(wèi)平,丁達(dá)夫;后基因組時(shí)代的基因組功能注釋[J];生物化學(xué)與生物物理進(jìn)展;2000年02期
,本文編號(hào):1448387
本文鏈接:http://www.sikaile.net/kejilunwen/jiyingongcheng/1448387.html
最近更新
教材專(zhuān)著