蛋白質(zhì)遠(yuǎn)同源性檢測(cè)和DNA結(jié)合蛋白識(shí)別研究
發(fā)布時(shí)間:2018-03-09 07:29
本文選題:蛋白質(zhì)遠(yuǎn)同源性檢測(cè) 切入點(diǎn):DNA結(jié)合蛋白 出處:《哈爾濱工業(yè)大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:蛋白質(zhì)是構(gòu)成生命的物質(zhì)基礎(chǔ),是生命活動(dòng)的主要承擔(dān)者。在后基因組時(shí)代,隨著蛋白質(zhì)測(cè)定技術(shù)的發(fā)展,蛋白質(zhì)序列數(shù)據(jù)庫(kù)規(guī)模呈爆炸式的增長(zhǎng)。因此,對(duì)蛋白質(zhì)的識(shí)別在生物學(xué)中具有重要的意義。本課題對(duì)蛋白質(zhì)的結(jié)構(gòu)和功能方面進(jìn)行深入的研究。在蛋白質(zhì)結(jié)構(gòu)方面,我們選取蛋白質(zhì)遠(yuǎn)同源性作為研究,不同物種中具有相同或相似功能的蛋白質(zhì)具有明顯序列同源性,基于蛋白質(zhì)序列同源性來(lái)判別未知類別的蛋白質(zhì)序列的超家族歸屬。在蛋白質(zhì)功能方面,我們選取了DNA結(jié)合蛋白作為研究。DNA結(jié)合蛋白在生命體中扮演著重要的角色。在基因的轉(zhuǎn)錄、重組、修復(fù)、復(fù)制等方面起了重要的作用。本文通過(guò)處理蛋白質(zhì)的一級(jí)序列,結(jié)合機(jī)器學(xué)習(xí)的方法對(duì)上面的兩個(gè)特定問(wèn)題進(jìn)行了深入的研究,具體的研究?jī)?nèi)容如下:蛋白質(zhì)遠(yuǎn)源檢測(cè)是蛋白質(zhì)結(jié)構(gòu)研究的基礎(chǔ)。本文提出偽二肽結(jié)構(gòu)狀態(tài)成分(Pseudo Dimer Composition,PDC)的概念。針對(duì)原始的偽氨基酸組成的信息不足,我們提出了改進(jìn)的方案。首先采用包含進(jìn)化信息的頻率譜將原始的序列轉(zhuǎn)換為包含進(jìn)化信息的蛋白質(zhì)序列。然后采用PDC特征提取方法將蛋白質(zhì)一級(jí)序列轉(zhuǎn)換為固定長(zhǎng)度的向量。結(jié)合支持向量機(jī)和集成學(xué)習(xí)策略預(yù)測(cè)蛋白質(zhì)的超家族的類別。該集成策略的方法是將每個(gè)家族的ROC值作為其權(quán)重,進(jìn)行線性集成。該方法的AUC為0.927,AUC50為0.749,該實(shí)驗(yàn)表明其方法優(yōu)于該領(lǐng)域的其他方法。DNA結(jié)合蛋白識(shí)別是蛋白質(zhì)功能研究的一個(gè)重要方向。本文首次將包含進(jìn)化信息的頻率譜和偽氨基酸組成應(yīng)用到該問(wèn)題上。首先通過(guò)序列譜和偽氨基酸組成將蛋白質(zhì)序列變?yōu)殚L(zhǎng)度固定的特征向量。采用支持向量機(jī)構(gòu)建分類器識(shí)別DNA結(jié)合蛋白。本章采取的集成方式是異態(tài)集成方法,通過(guò)擴(kuò)展樣本得到更多的訓(xùn)練模型進(jìn)行集成。在獨(dú)立測(cè)試集上,實(shí)驗(yàn)結(jié)果的準(zhǔn)確率為76.56%,AUC為0.8392。另外,通過(guò)分析支持向量機(jī)不同特征的權(quán)重,可分析對(duì)應(yīng)的氨基酸在識(shí)別過(guò)程的重要程度,進(jìn)而分析其在生物學(xué)上的特征。針對(duì)偽氨基酸組成的提取信息不足的問(wèn)題,我們提出一種融合K元氨基酸組成和自交叉協(xié)方差結(jié)合的方法。該方法克服了偽氨基酸組成包含信息不足的問(wèn)題。K元氨基酸組成方法包含了氨基酸距離對(duì)的信息,自交叉協(xié)方差方法包含了全局的氨基酸的理化信息。通過(guò)優(yōu)化特征參數(shù)組合,我們可以進(jìn)一步提高對(duì)DNA結(jié)合蛋白的準(zhǔn)確率。在獨(dú)立測(cè)試集上的實(shí)驗(yàn)結(jié)果顯示,該方法的預(yù)測(cè)精度為75.16%。該方法相較于其他方法有進(jìn)一步提升。本文在DNA結(jié)合蛋白問(wèn)題上提出一種基于近鄰傳播聚類策略的方法進(jìn)行選擇性集成的方法。為了提高預(yù)測(cè)的精度和進(jìn)一步深入研究集成方法,我們采用了基于縮減字母表距離對(duì)的特征提取策略。通過(guò)近鄰傳播聚類的集成策略,對(duì)656個(gè)基本分類器聚類集成。該方法在獨(dú)立測(cè)試集上的準(zhǔn)確率為83.87%,相比于其他方法其實(shí)驗(yàn)性能有進(jìn)一步提升。
[Abstract]:Protein is a material base of life, is mainly responsible for the activities of life. In the post genomic era, with the development of technology of determination of protein, protein sequence database, the scale of explosive growth. Therefore, the protein recognition has important significance in biology. This research on protein structure and function of study on protein structure, protein remote homology research as we selected, with the same or similar functions in different species have obvious protein sequence homology superfamily protein sequences belonging protein sequence homology to determine the unknown. Based on protein function, we selected the DNA binding protein as the research.DNA binding protein plays an important role in life. In gene transcription, recombination, repair, replication plays a important role Use. Through processing the protein primary sequences, combined with machine learning methods conducted in-depth research on two specific questions above, the specific contents are as follows: protein far source detection is the basis for the research of protein structure. In this paper, two pseudo peptide structure state component (Pseudo Dimer Composition, PDC) concept according to the composition of pseudo amino acid deficiency. The original information, we propose the improved scheme. Firstly, the frequency spectrum of evolutionary information contains the original sequence into a protein sequence contains the evolutionary information. Then the PDC feature extraction method of the protein sequence is converted into a fixed length vector. Combined with the prediction of super family category protein support vector machines and integrated learning strategies. The method of integrated strategy is that each family ROC value as the weight, linear integration. This method is 0 AUC .927, AUC50 is 0.749, the experimental results show that the.DNA method is better than the other methods in the field of protein identification is an important direction of research on protein function. In this paper, for the first time will contain the evolutionary information of the frequency spectrum and pseudo amino acid composition is applied to the problem. Firstly, through sequence spectrum and pseudo amino acid composition of protein sequence into features fixed length vector. By using the support vector machine classifier to build a DNA binding protein. This chapter adopts the integration mode is the ensemble method, by extending the sample to get the training model more integrated. In the independent test set, the accuracy of experimental results was 76.56%, AUC was 0.8392. in addition, support vector machine with different feature weight through the analysis, corresponding analysis of the amino acids in the degree of importance of the recognition process, and then analyzed the biological characteristics. According to the extracted pseudo amino acid composition The problem of insufficient information, we propose a method based on K meta amino acid composition and combining self cross covariance matrix. This method overcomes the problem of pseudo amino acid composition.K amino acids contain insufficient information which contains information on amino acid distance method, self cross covariance methods include physical and chemical information of global amino acids. By optimizing the feature combination of parameters, we can further improve the accuracy of the DNA binding protein. In the independent test set and the experimental results show that the prediction accuracy of this method is 75.16%. this method compared with other methods in this paper. To further enhance the DNA binding protein on the paper presents a method for selective method of affinity propagation clustering strategy based on integration in order to improve the accuracy of prediction and further research on the integration method, we use the reduced alphabet distance on feature extraction based on Strategy A clustering algorithm based on affinity propagation clustering is applied to ensemble 656 basic classifiers. The accuracy of the algorithm on independent test set is 83.87%. Compared with other methods, its performance is further improved.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:Q811.4;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 敖麗敏;羅存金;;基于神經(jīng)網(wǎng)絡(luò)集成的DNA序列分類方法研究[J];計(jì)算機(jī)仿真;2012年06期
2 張春霞;張講社;;選擇性集成學(xué)習(xí)算法綜述[J];計(jì)算機(jī)學(xué)報(bào);2011年08期
3 Kathy L. MOSER,Eric J. TOPOL;An ensemble method for gene discovery based on DNA microarray data[J];Science in China(Series C:Life Sciences);2004年05期
4 張春霆;生物信息學(xué)的現(xiàn)狀與展望[J];世界科技研究與發(fā)展;2000年06期
相關(guān)博士學(xué)位論文 前1條
1 鄒權(quán);基于二級(jí)結(jié)構(gòu)的非編碼RNA挖掘方法研究[D];哈爾濱工業(yè)大學(xué);2009年
,本文編號(hào):1587548
本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1587548.html
最近更新
教材專著