原核生物調(diào)控模體和調(diào)節(jié)子預(yù)測(cè)算法研究
本文選題:調(diào)控模體 + 調(diào)節(jié)子預(yù)測(cè) ; 參考:《山東大學(xué)》2014年博士論文
【摘要】:生物信息學(xué)是近年來快速發(fā)展的一門交叉學(xué)科,它綜合了生物、數(shù)學(xué)和計(jì)算機(jī)等領(lǐng)域的知識(shí)來進(jìn)行生物數(shù)據(jù)的分析和生命現(xiàn)象的研究.序列分析是生物信息學(xué)的一個(gè)重要組成部分,其中DNA序列模體預(yù)測(cè)一直是生物信息學(xué)中的一個(gè)重要研究問題,尤其是轉(zhuǎn)錄因子結(jié)合位點(diǎn)的預(yù)測(cè),既具有重要的生物意義,又具有算法設(shè)計(jì)上的難度.本論文主要研究的問題為原核生物基因表達(dá)調(diào)控模體和調(diào)節(jié)子的預(yù)測(cè)算法. 基因需要表達(dá)為相應(yīng)的蛋白質(zhì)才能發(fā)揮生物功能,并且需要針對(duì)不同自身與外界環(huán)境,對(duì)表達(dá)做出調(diào)控.原核生物的表達(dá)調(diào)控主要是通過RNA聚合酶和調(diào)控蛋白之間的相互作用實(shí)現(xiàn).調(diào)控蛋白能夠識(shí)別出基因組DNA序列上特定的序列片段,并與之結(jié)合,起到調(diào)控作用,這些特定序列稱為調(diào)控蛋白結(jié)合位點(diǎn).因此在基因組中不但包含了編碼蛋白質(zhì)和RNA的基因序列,還包含了調(diào)節(jié)基因表達(dá)的調(diào)控序列.同一調(diào)控蛋白的結(jié)合位點(diǎn)的長(zhǎng)度一般相同,并具有較高的序列保守性,這種序列的保守模式,稱為一個(gè)cis-調(diào)控模體.在原核生物中,基因組上多個(gè)連續(xù)的基因往往構(gòu)成一個(gè)操縱子,能夠共同轉(zhuǎn)錄;單個(gè)基因也可看作操縱子的特殊類型.被同一調(diào)控蛋白所調(diào)控的操縱子的集合,稱為一個(gè)調(diào)節(jié)子. 在這篇論文中,我們首先對(duì)調(diào)控模體的模型表示和預(yù)測(cè)算法做了簡(jiǎn)要介紹.在已有模體預(yù)測(cè)算法的基礎(chǔ)上,結(jié)合原核生物全基因組中調(diào)控結(jié)合位點(diǎn)的分布特征,我們?cè)O(shè)計(jì)了對(duì)所預(yù)測(cè)模體的生物功能顯著性進(jìn)行考量的方法,能夠?qū)λA(yù)測(cè)出的模體進(jìn)行準(zhǔn)確的篩選;利用模體信息量和保守性特征進(jìn)行模體的相似性分析和聚類分析;利用超幾何分布等統(tǒng)計(jì)工具分析模體在全基因組上的共存在特征.這一系列的方法構(gòu)成了模體預(yù)測(cè)分析工具包BoBro2.0,相應(yīng)軟件可通過http://code.google.com/p/bobro/免費(fèi)下載使用. 結(jié)合模體預(yù)測(cè)與系統(tǒng)發(fā)生足跡法,我們?cè)O(shè)計(jì)了全基因組調(diào)節(jié)子預(yù)測(cè)的新方法.系統(tǒng)發(fā)生足跡法使我們能夠從同源基因的調(diào)控區(qū)域中發(fā)現(xiàn)調(diào)控模體,然而這些結(jié)果往往具有非常高的假陽(yáng)性.為了克服這個(gè)問題,我們?cè)O(shè)計(jì)了基于二部圖的模體的相似性比較方法,能夠?qū)λ心sw進(jìn)行初步篩選,并產(chǎn)生了反映操縱子間共調(diào)控關(guān)系的得分,即如果兩個(gè)操縱子之間具有較高的得分,那么它們屬于同一個(gè)或多個(gè)調(diào)節(jié)子的可能性較大.我們只保留了能夠產(chǎn)生較高得分的模體,用來構(gòu)造模體相似性圖,其中以單個(gè)模體作為點(diǎn),以較顯著的相似性得分做邊,整個(gè)圖反映出所預(yù)測(cè)出的模體之間的相似性關(guān)系.通過對(duì)已知的調(diào)節(jié)子所對(duì)應(yīng)的圖中的點(diǎn)集進(jìn)行分析,我們發(fā)現(xiàn)由這些點(diǎn)集所導(dǎo)出的子圖比原圖具有更高的邊密度和聚類系數(shù),因而能夠反映出原核生物調(diào)節(jié)子的特征.利用這一發(fā)現(xiàn),通過設(shè)計(jì)聚類算法,我們從圖中獲得了對(duì)應(yīng)真實(shí)調(diào)節(jié)子的操縱子集合.通過與其它兩種能夠反映共調(diào)控關(guān)系的分?jǐn)?shù)的比較,我們?cè)O(shè)計(jì)的方法更加準(zhǔn)確反映共調(diào)控關(guān)系;并且由于我們以模體作為點(diǎn)來預(yù)測(cè)調(diào)節(jié)子,很好的解決了調(diào)節(jié)子之間的交集會(huì)使聚類過程不準(zhǔn)確的問題,從而更準(zhǔn)確預(yù)測(cè)調(diào)節(jié)子.我們的預(yù)測(cè)流程完全基于基因組序列數(shù)據(jù),不需要過多的生物注釋信息作為輔助,這對(duì)于新測(cè)序出的基因組具有更重要的使用價(jià)值. 為了方便生物學(xué)家使用我們?cè)O(shè)計(jì)的算法和工具,我們開發(fā)了以操縱子數(shù)據(jù)為核心的線上數(shù)據(jù)庫(kù)DOOR2.0其中包含了2072個(gè)完全測(cè)序的原核生物基因組的操縱子結(jié)構(gòu),而且具有基因功能注釋和經(jīng)過實(shí)驗(yàn)驗(yàn)證的調(diào)控蛋白結(jié)合位點(diǎn)信息.與發(fā)表于2009年的之前版本相比,DOOR2.0具有一些列新的特征,(i)包含了來自于實(shí)驗(yàn)驗(yàn)證或者基于RNA-seq數(shù)據(jù)計(jì)算預(yù)測(cè)出的250000個(gè)轉(zhuǎn)錄單元結(jié)構(gòu),提供了操縱子的動(dòng)態(tài)功能展示;(ii)整合了以操縱子為中心的數(shù)據(jù)資源,不僅對(duì)每個(gè)涉及的基因組提供操縱子結(jié)構(gòu),而且有功能和調(diào)控信息,例如cis-調(diào)控因子結(jié)合位點(diǎn),啟動(dòng)子和終止子結(jié)構(gòu);(iii)對(duì)用戶提供的基因組進(jìn)行操縱子預(yù)測(cè)的高效網(wǎng)絡(luò)服務(wù);(iv)使用直觀的基因組瀏覽器對(duì)用戶選擇的數(shù)據(jù)進(jìn)行可視化展示;(v)類似于Google搜索的基于關(guān)鍵詞的搜索引擎,可以從數(shù)據(jù)庫(kù)中快速查找所需的信息.數(shù)據(jù)庫(kù)會(huì)根據(jù)測(cè)序數(shù)據(jù)的發(fā)布進(jìn)行更新,可通過http://csbl.bmb.uga.edu/DOOR/進(jìn)行訪問,所有數(shù)據(jù)和功能均免費(fèi)提供給用戶.最后,利用比較基因組學(xué)的種種方法和我們的模體分析工具,我們對(duì)梭狀芽孢桿菌的40個(gè)物種進(jìn)行了系統(tǒng)的分析,尤其注重與生物質(zhì)降解相關(guān)的基因和功能.通過這些研究,不僅做出了有生物研究?jī)r(jià)值的發(fā)現(xiàn),也驗(yàn)證了我們開發(fā)的方法的實(shí)用價(jià)值.
[Abstract]:Bioinformatics is a rapid development in recent years. It combines the knowledge of biological, mathematical and computer fields to analyze biological data and study the life phenomenon. Sequence analysis is an important part of bioinformatics. The prediction of DNA sequence model body is always an important part of bioinformatics. The research problem, especially the prediction of the transcription factor binding site, has both important biological significance and the difficulty of algorithm design. The main problem in this paper is the prediction algorithm of modulo body and regulator for gene expression in prokaryotes.
The gene needs to be expressed as the corresponding protein to play a biological function, and the expression needs to be regulated for different self and external environment. The regulation of the expression of prokaryotes is realized mainly through the interaction between RNA polymerase and regulatory protein. In combination with it, these specific sequences are called regulatory protein binding sites. Therefore, the genome contains not only the sequence of genes encoding proteins and RNA, but also the regulatory sequences that regulate the expression of genes. The length of the binding site of the same regulatory protein is the same and has a higher sequence conservatism. The conservative model of a sequence is called a cis- regulatory model. In the prokaryotes, a number of successive genes in the genome often constitute an operon, which can be transcribed together; a single gene can also be seen as a special type of the operon. The aggregation of the operon controlled by the same regulatory protein is called a regulator.
In this paper, we first briefly introduce the model representation and prediction algorithm of the regulated model body. On the basis of the existing model body prediction algorithm, combined with the distribution characteristics of the regulated binding sites in the whole genome of the prokaryotes, we design a method to estimate the significance of the biological power of the predicted model body, which can be predicted. The model body is screened accurately, the model body similarity analysis and cluster analysis are carried out using the model body information quantity and conservatism characteristics. The common characteristics of the model body in the whole genome are analyzed by the statistical tools such as hypergeometric distribution. This series of methods constitute the model body prediction and analysis toolkit BoBro2.0, and the corresponding software can be used through the http //code.google.com/p/bobro/: free download and use.
We designed a new method to predict the whole genome by combining the model body prediction and the systematic footprint method. The systematic footprint method enables us to discover the modulo bodies from the control regions of the homologous genes. However, these results often have very high false positive results. In order to overcome this problem, we designed the model based on the two graph. The similarity comparison method of the body can make a preliminary screening of all the modules and produce a score reflecting the co regulation relationship between the operators, that is, if there is a higher score between the two operon, then they are more likely to belong to the same or multiple regulators. A pattern of structural similarity, in which a single model body is used as a point, with a more significant similarity score, and the whole graph reflects the similarity relation between the predicted models. By analyzing the set of points in the graph corresponding to the known regulator, we find that the subgraphs derived from these points have a higher edge density than the original graph. The degree and the clustering coefficient can reflect the characteristics of the prokaryotes regulator. By using this discovery, we obtain the operon set corresponding to the real regulator by designing the clustering algorithm. By comparing with the other two kinds of scores that can reflect the co regulation relationship, our design method is more accurate to reflect the common regulation and control. And because we predict the regulator with the model body as a point, it is very good to solve the problem that the intersection of the regulators will make the clustering process inaccurate, so that the regulator is more accurately predicted. Our prediction process is based on the genome sequence data and does not need too much raw material annotation information as a supplement, which is for the new sequencing. The genome has a more important use value.
In order to facilitate the biologists to use the algorithms and tools we design, we developed an online database DOOR2.0 based on the core of the operon data, which contains the operon structure of the genome of 2072 completely sequencing prokaryotes, and has the gene function annotation and the experimental verification of the regulatory protein binding site information. Compared with previous versions of 2009, DOOR2.0 has some new features, and (I) contains 250000 transcriptional unit structures derived from experimental validation or based on RNA-seq data computing, providing a dynamic functional display of the operon; (II) integration of data resources centered on the operon, not only for each involved genome. The operon structure, and has functional and regulatory information, such as the cis- regulatory factor binding site, promoter and terminator structure; (III) efficient network services for the user's genome for operon prediction; (IV) visualizing user selected data using an intuitive genome browser; (V) similar to Google search The keyword based search engine can quickly find the information needed from the database. The database will be updated according to the publication of the sequencing data, and can be accessed through http://csbl.bmb.uga.edu/DOOR/. All data and functions are provided free of charge to the user. Finally, the various methods of comparative genomics and our model body are used. Analysis tools, we systematically analyzed 40 species of Clostridium spore, focusing on genes and functions related to biodegradation, which not only made the discoveries of biological research value, but also proved the practical value of the methods we developed.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:Q811.4
【共引文獻(xiàn)】
相關(guān)期刊論文 前10條
1 Xundou Li;Mindi Zhao;Menglin Li;Lulu Jia;Youhe Gao;;Effects of Three Commonly-used Diuretics on the Urinary Proteome[J];Genomics,Proteomics & Bioinformatics;2014年03期
2 陳清利;畢勝男;于家峰;;基于密碼子偏好特征的原核基因組多拷貝基因序列分析[J];德州學(xué)院學(xué)報(bào);2014年06期
3 丁秀蕾;張艷凱;榮霞;張開軍;趙冬曉;洪曉月;;基于wsp基因的葉螨體內(nèi)Wolbachia株系的多樣性與重組分析[J];應(yīng)用昆蟲學(xué)報(bào);2013年02期
4 黃麗娟;伊珍珍;林曉鳳;;原生生物基因重復(fù)研究進(jìn)展[J];華南師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年05期
5 吳又多;付友思;齊高相;陳麗杰;白鳳武;;基于果糖與葡萄糖不同混合比例的丙酮丁醇發(fā)酵[J];化工進(jìn)展;2014年06期
6 MA Qin;CHEN Xin;LIU Chao;MAO XiZeng;ZHANG HanYuan;JI Fei;WU ChunGuo;XU Ying;;Understanding the commonalities and differences in genomic organizations across closely related bacteria from an energy perspective[J];Science China(Life Sciences);2014年11期
7 Eudes GV Barbosa;Flavia F Aburjaile;Rommel TJ Ramos;Adriana R Carneiro;Yves Le Loir;Jan Baumbach;Anderson Miyoshi;Artur Silva;Vasco Azevedo;;Value of a newly sequenced bacterial genome[J];World Journal of Biological Chemistry;2014年02期
8 Quan-Jiang Dong;Li-Li Wang;Zi-Bing Tian;Xin-Jun Yu;Sheng-Jiao Jia;Shi-Ying Xuan;;Reduced genome size of Helicobacter pylori originating from East Asia[J];World Journal of Gastroenterology;2014年19期
9 饒瓊;吳慧明;;昆蟲專性內(nèi)共生細(xì)菌及其基因組研究進(jìn)展[J];微生物學(xué)報(bào);2014年07期
10 Bi Ma;Yiwei Luo;Ling Jia;Xiwu Qi;Qiwei Zeng;Zhonghuai Xiang;Ningjia He;;Genome-wide identification and expression analyses of cytochrome P450 genes in mulberry(Morus notabilis)[J];Journal of Integrative Plant Biology;2014年09期
相關(guān)博士學(xué)位論文 前10條
1 馬勤;原核生物中調(diào)節(jié)子的研究和預(yù)測(cè)[D];山東大學(xué);2010年
2 吳浩;細(xì)菌dnaE聚合酶的分化及對(duì)細(xì)菌基因組進(jìn)化的影響[D];浙江大學(xué);2012年
3 解少俊;玉米表觀遺傳組的研究[D];中國(guó)農(nóng)業(yè)大學(xué);2014年
4 顧敬敏;金黃色葡萄球菌噬菌體GH15及其裂解酶三維結(jié)構(gòu)與分子作用機(jī)制研究[D];吉林大學(xué);2014年
5 王彥芹;荒漠植物H~+-PPase基因的系統(tǒng)發(fā)育分析及SaVP1和KcNHX1基因的功能鑒定[D];華中農(nóng)業(yè)大學(xué);2013年
6 邢麗娟;CK1δ/ε對(duì)SR motif激酶活性的進(jìn)化[D];南京大學(xué);2013年
7 陳庚;整合多層次數(shù)據(jù)多方位解析和注釋人類轉(zhuǎn)錄組[D];華東師范大學(xué);2014年
8 張懿璞;轉(zhuǎn)錄因子結(jié)合位點(diǎn)識(shí)別問題的算法研究[D];西安電子科技大學(xué);2014年
9 李遜斗;食物中能夠進(jìn)入淋巴液的蛋白尿蛋白質(zhì)組影響因素及腎癌標(biāo)志物的研究[D];北京協(xié)和醫(yī)學(xué)院;2014年
10 吳學(xué)龍;PPT1基因調(diào)控植物生長(zhǎng)發(fā)育的研究及葉脈特異表達(dá)增強(qiáng)子的分離應(yīng)用[D];浙江大學(xué);2013年
相關(guān)碩士學(xué)位論文 前10條
1 王劍峰;Paenibacillus mucilaginosus KNP414全基因組測(cè)序及分析[D];浙江理工大學(xué);2011年
2 呂羿;小鼠大腦及胰腺組織時(shí)序特異性可變剪接轉(zhuǎn)錄本的全基因組分析[D];華中師范大學(xué);2013年
3 宓大云;職業(yè)院校學(xué)生信息管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
4 黃景;蕭山區(qū)紀(jì)委辦公自動(dòng)化系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
5 李軍;蘋果SBP轉(zhuǎn)錄因子家族基因的鑒定、系統(tǒng)進(jìn)化及表達(dá)研究[D];西北農(nóng)林科技大學(xué);2013年
6 孫賽劍;杭州計(jì)生委電子政務(wù)辦公系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
7 宋寶興;功能相似蛋白質(zhì)挖掘及蛋白質(zhì)相互作用預(yù)測(cè)平臺(tái)[D];西北農(nóng)林科技大學(xué);2013年
8 張杰;重要條件致病菌可移動(dòng)基因組的研究[D];天津科技大學(xué);2010年
9 張倍倍;高職院校畢業(yè)生就業(yè)管理系統(tǒng)的設(shè)計(jì)與開發(fā)[D];電子科技大學(xué);2013年
10 趙雷;基于J2EE架構(gòu)的學(xué)生管理信息系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
,本文編號(hào):1944848
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1944848.html