基于GPU的Dirichlet算法并行計算設計與實現(xiàn)

發(fā)布時間：2018-10-05 18:11

【摘要】：近年來,信息技術的普及和硬件技術的快速發(fā)展,為大數(shù)據(jù)產(chǎn)生與存儲提供了先決條件。在商業(yè)上、科研機構、政府部門等都存儲著大量的數(shù)據(jù)。而如何從這些大量的數(shù)據(jù)集中提取有用信息成為了人們?nèi)找骊P注的主題,數(shù)據(jù)挖掘正是在這樣的背景下受到關注并得到了快速的發(fā)展。聚類作為數(shù)據(jù)挖掘的重要工具,是將相似對象劃分為同組,不相似對象劃為不同組的過程,在各個領域得到了廣泛的應用。本文首先介紹了數(shù)據(jù)挖掘和聚類分析的基礎理論,并重點研究了Dirichlet混合模型聚類,接著以Apache Mahout機器學習庫為基礎,研究了Dirichlet過程混合模型算法及其具體實現(xiàn)。該混合模型是一種以Dirichlet過程為先驗的貝葉斯混合模型。Mahout提供了單機實現(xiàn)和MapReduce實現(xiàn)方式,本文主要研究了后者。文中首先以多組數(shù)據(jù)集作為算法輸入來研究Dirichlet過程聚類算法,通過對運行結果的分析,得出算法主要開銷集中在map函數(shù)的處理這一結論。本文還研究了GPU(圖形處理器),并提出了以GPU并行方式來提高算法效率的改進方案。本文研究了GPU的體系架構及其優(yōu)勢,以及CUDA并行編程實現(xiàn)。然后在Mahout提供的Dirichlet過程混合模型算法源碼基礎上,實現(xiàn)了以JNI調(diào)用本地CUDA程序的改進方案,其中,CUDA程序以并行方式來處理了map函數(shù)。最后,本文以同樣的數(shù)據(jù)作為輸入,并分析了運行結果。通過比較源程序與改進程序的運行性能,得出改進的程序提高了算法效率,當數(shù)據(jù)量較大時,提升效果更為明顯。這些為數(shù)據(jù)挖掘算法的性能研究提供有益參考。
[Abstract]:In recent years, the popularization of information technology and the rapid development of hardware technology provide a prerequisite for big data to produce and store. In business, research institutions, government departments and so on are storing a lot of data. However, how to extract useful information from these large data sets has become a topic of increasing concern. Data mining has been paid close attention to and developed rapidly under this background. As an important tool of data mining, clustering is the process of dividing similar objects into the same group and dissimilar objects into different groups, and has been widely used in various fields. In this paper, the basic theory of data mining and clustering analysis is introduced, and the Dirichlet hybrid model clustering is studied. Then, based on the Apache Mahout machine learning library, the Dirichlet process hybrid model algorithm and its implementation are studied. The hybrid model is a Bayesian hybrid model with Dirichlet process as a priori. Mahout provides a single machine implementation and a MapReduce implementation. The latter is mainly studied in this paper. In this paper, the multi-group data set is used as the input of the algorithm to study the clustering algorithm of Dirichlet process. Through the analysis of the running results, it is concluded that the main cost of the algorithm is the processing of the map function. This paper also studies GPU (graphics processor) and proposes an improved scheme to improve the efficiency of the algorithm by GPU parallelism. This paper studies the architecture and advantages of GPU, and the implementation of CUDA parallel programming. Then on the basis of the source code of Dirichlet process mixed model algorithm provided by Mahout, an improved scheme of calling local CUDA program by JNI is implemented, in which the map function is processed by JNI program in parallel. Finally, the same data is used as input and the result is analyzed. By comparing the performance of the source program and the improved program, it is concluded that the improved program improves the efficiency of the algorithm, and when the amount of data is large, the improvement effect is more obvious. These provide a useful reference for the performance research of data mining algorithms.
【學位授予單位】：北京郵電大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP311.13;TP338.6

【參考文獻】

相關期刊論文前2條

1 徐謙;周俊生;陳家駿;;Dirichlet過程及其在自然語言處理中的應用[J];中文信息學報;2009年05期

2 易瑩瑩;;基于Dirichlet過程的非參數(shù)貝葉斯方法研究綜述[J];統(tǒng)計與決策;2012年04期

，

本文編號：2254368

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/2254368.html

上一篇：基于粗粒度可重構處理器的浮點乘加算法
下一篇：談談如何提高高校計算機實驗教學質(zhì)量

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于GPU的Dirichlet算法并行計算設計與實現(xiàn)