基于基因組數(shù)據(jù)的癌癥亞型發(fā)現(xiàn)聚類研究
發(fā)布時間:2018-01-28 05:52
本文關(guān)鍵詞: 癌癥 癌癥亞型 癌癥基因組 癌癥基因組圖譜(TCGA) 基因調(diào)控網(wǎng)絡(luò) 數(shù)據(jù)挖掘 聚類分析 出處:《中國科學(xué)技術(shù)大學(xué)》2016年博士論文 論文類型:學(xué)位論文
【摘要】:癌癥亞型的定義和發(fā)現(xiàn)是針對癌癥個性化治療的一個重要組成部分,將癌癥樣本正確歸類到不同的亞型能夠為病人選擇正確的治療方法提供非常重要的參考;蚪M技術(shù)的發(fā)展和應(yīng)用,可以獲取癌癥病例全基因組的高通量測序數(shù)據(jù),為人們在全基因組水平上研究癌癥個體的差異和癌癥的發(fā)生、發(fā)展以及轉(zhuǎn)移機制創(chuàng)造了條件。然而,癌癥基因組數(shù)據(jù)是多譜系高維特征的生物大數(shù)據(jù)集合,高維、高噪聲、低樣本數(shù)是生物大數(shù)據(jù)的普遍特征,給傳統(tǒng)數(shù)據(jù)挖掘技術(shù)應(yīng)用提出了新的挑戰(zhàn);基因組技術(shù)的發(fā)展積累了大量的癌癥樣本數(shù)據(jù),如何利用數(shù)據(jù)挖掘的大數(shù)據(jù)分析方法處理這些癌癥基因組數(shù)據(jù),探索每一種癌癥存在的可能亞型及其相應(yīng)的腫瘤分子標記物,將對癌癥研究和治療具有非常重要的現(xiàn)實意義。本文以癌癥基因組數(shù)據(jù)為研究對象,針對癌癥基因組數(shù)據(jù)高維性和多譜系的特點,主要研究在癌癥亞型發(fā)現(xiàn)的聚類分析中有關(guān)癌癥基因組數(shù)據(jù)的處理和融合方法,同時探索癌癥基因組數(shù)據(jù)的新型聚類算法。癌癥基因組學(xué)是通過高通量測序技術(shù)將基因與癌癥研究進行關(guān)聯(lián),基因芯片技術(shù)和二代測序技術(shù)作為當前癌癥基因組數(shù)據(jù)獲取的主要來源,本文對其技術(shù)特點及技術(shù)細節(jié)進行詳細論述;對迄今為止最大的癌癥基因組研究項目癌癥基因組圖譜(TCGA)計劃進行比較全面的介紹。本文構(gòu)建了基于基因組數(shù)據(jù)的癌癥亞型發(fā)現(xiàn)研究的分析框架,主要包括基因組數(shù)據(jù)的預(yù)處理方法,基因組數(shù)據(jù)重要特征提取方法,基因組數(shù)據(jù)的聚類方法,以及聚類結(jié)果的評估方法;詳細介紹了數(shù)據(jù)過濾、數(shù)據(jù)補齊和數(shù)據(jù)標準化的基因組數(shù)據(jù)預(yù)處理方法;提出四種基因組數(shù)據(jù)特征選擇方法;聚類算法作為基于基因組數(shù)據(jù)的癌癥亞型發(fā)現(xiàn)的核心內(nèi)容,本文系統(tǒng)介紹了一致性聚類、一致性非負矩陣因式分解、多基因組數(shù)據(jù)集成聚類和相似性網(wǎng)絡(luò)融合四種主要癌癥亞型發(fā)現(xiàn)的計算生物學(xué)方法:針對聚類結(jié)果的評估向題,本文給出了生存分析、Silhouette方法以及聚類統(tǒng)計顯著性檢驗的評價指標。多基因組數(shù)據(jù)挖掘聚類研究是定義和發(fā)現(xiàn)癌癥亞型的一種非常有效的途徑,并且已經(jīng)在很多癌癥研究中產(chǎn)生了非常重要的發(fā)現(xiàn)和應(yīng)用。有關(guān)癌癥亞型發(fā)現(xiàn)的新計算生物學(xué)方法在不斷的發(fā)展,目前存在的基于基因組數(shù)據(jù)的癌癥亞型發(fā)現(xiàn)方法都是“純”機器學(xué)習(xí)方法,然而生命科學(xué)的復(fù)雜性決定了“純”機器學(xué)習(xí)方法不能完全有效解決癌癥亞型識別問題。本文引入基因調(diào)控網(wǎng)絡(luò)分析,將基因調(diào)控網(wǎng)絡(luò)集成到多基因組融合聚類過程中,提出基于miRNA-TF-mRNA基因調(diào)控網(wǎng)絡(luò)加權(quán)相似性融合算法,集成基因組表達數(shù)據(jù)和基因調(diào)控網(wǎng)絡(luò)信息實現(xiàn)對癌癥樣本的聚類分析,得到了有生物學(xué)意義的癌癥亞型。
[Abstract]:The definition and discovery of cancer subtypes is an important part of personalized treatment for cancer. The correct classification of cancer samples into different subtypes can provide a very important reference for patients to choose the right treatment methods. Development and application of genomic technology. High-throughput sequencing data of the whole genome of cancer cases can be obtained, which provides conditions for the study of the difference of cancer individuals and the occurrence, development and metastasis mechanism of cancer at the whole genome level. Cancer genome data is a biological big data set with multi-lineage and high-dimensional features. High dimension, high noise and low sample number are the universal features of biological big data, which brings a new challenge to the application of traditional data mining technology. The development of genomic technology has accumulated a large number of cancer sample data, how to use data mining big data analysis method to deal with these cancer genome data. Exploring the possible subtypes of each cancer and its corresponding tumor molecular markers will be of great practical significance for cancer research and treatment. In view of the characteristics of high dimensional and multi-pedigree of cancer genome data, the methods of processing and fusion of cancer genome data in cluster analysis of cancer subtypes were studied. At the same time, it explores a new clustering algorithm for cancer genome data. Cancer genomics links genes to cancer research through high-throughput sequencing techniques. Gene chip technology and second-generation sequencing technology are the main sources of current cancer genome data acquisition. This paper discusses their technical characteristics and technical details in detail. The TCGA-based cancer genome mapping project, the largest cancer genome research project, is introduced in this paper. In this paper, an analytical framework for cancer subtype discovery based on genomic data is constructed. It mainly includes the preprocessing method of genome data, the extraction method of important feature of genome data, the clustering method of genome data, and the evaluation method of clustering result. The preprocessing methods of data filtering, data collation and data standardization are introduced in detail. Four methods for feature selection of genomic data are proposed. Clustering algorithm is the core of cancer subtype discovery based on genomic data. This paper systematically introduces consistent clustering and consistent non-negative matrix factorization. Multi-genome data integration clustering and similarity network fusion of four major cancer subtypes of computational biology methods: for the evaluation of clustering results, this paper gives a survival analysis. The Silhouette method and the evaluation index of clustering statistical significance test. Multi-genome data mining clustering research is a very effective way to define and find cancer subtypes. And has produced very important discovery and application in many cancer research. The new computational biology method of cancer subtype discovery is developing continuously. The existing methods of cancer subtype discovery based on genomic data are "pure" machine learning methods. However, because of the complexity of life science, "pure" machine learning method can not solve the problem of cancer subtype recognition effectively. In this paper, gene regulation network analysis is introduced. A weighted similarity fusion algorithm based on miRNA-TF-mRNA gene control network was proposed by integrating gene control network into multi-genome fusion clustering process. The cluster analysis of cancer samples was carried out by integrating genomic expression data and gene regulatory network information, and the cancer subtypes with biological significance were obtained.
【學(xué)位授予單位】:中國科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:R73-3;TP311.13
【相似文獻】
相關(guān)重要報紙文章 前2條
1 藍岸;中國有望首獲黃種人基因組數(shù)據(jù)[N];深圳特區(qū)報;2007年
2 本報記者 任荃;公共基因組數(shù)據(jù)被污染了[N];文匯報;2011年
相關(guān)博士學(xué)位論文 前1條
1 許桃勝;基于基因組數(shù)據(jù)的癌癥亞型發(fā)現(xiàn)聚類研究[D];中國科學(xué)技術(shù)大學(xué);2016年
相關(guān)碩士學(xué)位論文 前2條
1 董伯Oz;節(jié)節(jié)麥基因組數(shù)據(jù)平臺的構(gòu)建[D];吉林大學(xué);2013年
2 林延春;個人基因組數(shù)據(jù)管理研究[D];哈爾濱工業(yè)大學(xué);2010年
,本文編號:1469954
本文鏈接:http://www.sikaile.net/shoufeilunwen/xxkjbs/1469954.html
最近更新
教材專著