基于Spark的機(jī)器學(xué)習(xí)模型分析與研究
發(fā)布時(shí)間:2018-08-26 18:16
【摘要】:在分布式計(jì)算為主流的時(shí)代背景下,基于MapReduce框架的分布式應(yīng)用頻繁的I/O操作使得它的效率和性能不能夠得到完美的體現(xiàn);赗DD的Spark分布式計(jì)算框架能夠?qū)?shù)據(jù)加載進(jìn)內(nèi)存,極大的適應(yīng)了迭代式機(jī)器學(xué)習(xí)模型的特定需求。針對(duì)目前基于MapReduce設(shè)計(jì)實(shí)現(xiàn)的機(jī)器學(xué)習(xí)模型存在的問(wèn)題(主要是MR的本質(zhì)問(wèn)題),研究了基于Spark的機(jī)器學(xué)習(xí)模型,主要包括KMeans聚類、ALS協(xié)同過(guò)濾。并且研究了基于Spark Streaming的在線機(jī)器學(xué)習(xí)模型。以下是文章的主要分析與研究?jī)?nèi)容簡(jiǎn)介:(1)文章基于Spark分布式計(jì)算框架設(shè)計(jì)并實(shí)現(xiàn)了并行KMeans聚類模型,并通過(guò)該模型在不同規(guī)模的MovieLens數(shù)據(jù)集上進(jìn)行訓(xùn)練比對(duì)實(shí)驗(yàn),結(jié)果表明,該并行KMeans聚類模型適合運(yùn)行在分布式集群環(huán)境下,且并行化計(jì)算效率也有不俗的表現(xiàn);其次通過(guò)repartition算子設(shè)計(jì)分片加載數(shù)據(jù),優(yōu)化并行方案,有效減少了模型的訓(xùn)練時(shí)間。(2)針對(duì)基于MapReduce框架處理海量數(shù)據(jù)實(shí)時(shí)響應(yīng)能力較差的問(wèn)題,設(shè)計(jì)并實(shí)現(xiàn)了基于Spark Streaming的在線計(jì)算模型進(jìn)行大規(guī)模的KMeans聚類分析。該模型將整個(gè)過(guò)程分為數(shù)據(jù)接入、在線訓(xùn)練等模塊,各模塊通過(guò)數(shù)據(jù)流連通形成任務(wù)實(shí)體,提交到Spark分布式集群運(yùn)行完成。通過(guò)比對(duì)分析實(shí)驗(yàn)和性能檢測(cè),驗(yàn)證了該在線KMeans聚類模型具有高吞吐、低延遲的優(yōu)勢(shì),且集群運(yùn)行狀況良好。(3)ALS(最小二乘法)協(xié)同過(guò)濾推薦算法是通過(guò)矩陣分解進(jìn)行推薦,它通過(guò)綜合大量的用戶評(píng)分?jǐn)?shù)據(jù)進(jìn)行計(jì)算,并存儲(chǔ)計(jì)算過(guò)程中產(chǎn)生的大量特征矩陣。Hadoop的HA(高可用性)用來(lái)解決HDFS分布式文件系統(tǒng)的NameNode單點(diǎn)故障問(wèn)題。Spark作為一種基于內(nèi)存的新型分布式大數(shù)據(jù)計(jì)算框架,具有優(yōu)異的計(jì)算性能。文章基于QJM(Quorum Journal Manager)構(gòu)建了 HA下的Hadoop大數(shù)據(jù)平臺(tái),并在Spark計(jì)算框架基礎(chǔ)上研究使用ALS協(xié)同過(guò)濾算法,實(shí)現(xiàn)基于ALS協(xié)同過(guò)濾算法在Spark上的并行化運(yùn)行;通過(guò)和基于Hadoop的MapReduce思想的ALS協(xié)同過(guò)濾算法在Netflix數(shù)據(jù)集上的比對(duì)實(shí)驗(yàn)表明,基于Spark平臺(tái)的ALS協(xié)同過(guò)濾算法的并行化計(jì)算效率有明顯提升,并且更適合處理海量數(shù)據(jù)。
[Abstract]:Under the background of the mainstream of distributed computing, the efficiency and performance of distributed applications based on MapReduce framework can not be reflected perfectly because of the frequent I / O operations. The Spark distributed computing framework based on RDD can load data into memory, which greatly meets the specific requirements of iterative machine learning model. Aiming at the problems of the machine learning model based on MapReduce (mainly the essential problem of MR), this paper studies the machine learning model based on Spark, including KMeans clustering and collaborative filtering. An online machine learning model based on Spark Streaming is also studied. The following are the main analysis and research contents: (1) this paper designs and implements a parallel KMeans clustering model based on Spark distributed computing framework, and carries out training and comparison experiments on MovieLens data sets of different scales through this model. The results show that, The parallel KMeans clustering model is suitable for running in the distributed cluster environment, and the parallel computing efficiency is also good. Secondly, the parallel scheme is optimized by using repartition operator to design piecewise data loading. The training time of the model is reduced effectively. (2) aiming at the problem of poor real-time response ability of processing massive data based on MapReduce framework, an online computing model based on Spark Streaming is designed and implemented for large-scale KMeans clustering analysis. The model divides the whole process into data access, online training and other modules. Each module is connected by data flow to form a task entity, which is submitted to the Spark distributed cluster to run. By comparing and analyzing experiments and performance testing, it is proved that the online KMeans clustering model has the advantages of high throughput and low delay, and the cluster runs well. (3) ALS (least square) collaborative filtering recommendation algorithm is recommended by matrix decomposition. It's calculated by synthesizing a lot of user rating data, And store a large number of feature matrices. Hadoop HA (high availability) used to solve the HDFS distributed file system NameNode single point problem. Spark as a new memory based distributed big data computing framework. Excellent computing performance. In this paper, the Hadoop big data platform under HA is constructed based on QJM (Quorum Journal Manager), and the ALS collaborative filtering algorithm is studied on the basis of Spark computing framework to realize the parallel running of ALS based collaborative filtering algorithm on Spark. The comparison experiment with ALS collaborative filtering algorithm based on MapReduce based on Hadoop on Netflix dataset shows that the parallel computing efficiency of ALS collaborative filtering algorithm based on Spark platform is obviously improved and it is more suitable to deal with mass data.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;TP181
[Abstract]:Under the background of the mainstream of distributed computing, the efficiency and performance of distributed applications based on MapReduce framework can not be reflected perfectly because of the frequent I / O operations. The Spark distributed computing framework based on RDD can load data into memory, which greatly meets the specific requirements of iterative machine learning model. Aiming at the problems of the machine learning model based on MapReduce (mainly the essential problem of MR), this paper studies the machine learning model based on Spark, including KMeans clustering and collaborative filtering. An online machine learning model based on Spark Streaming is also studied. The following are the main analysis and research contents: (1) this paper designs and implements a parallel KMeans clustering model based on Spark distributed computing framework, and carries out training and comparison experiments on MovieLens data sets of different scales through this model. The results show that, The parallel KMeans clustering model is suitable for running in the distributed cluster environment, and the parallel computing efficiency is also good. Secondly, the parallel scheme is optimized by using repartition operator to design piecewise data loading. The training time of the model is reduced effectively. (2) aiming at the problem of poor real-time response ability of processing massive data based on MapReduce framework, an online computing model based on Spark Streaming is designed and implemented for large-scale KMeans clustering analysis. The model divides the whole process into data access, online training and other modules. Each module is connected by data flow to form a task entity, which is submitted to the Spark distributed cluster to run. By comparing and analyzing experiments and performance testing, it is proved that the online KMeans clustering model has the advantages of high throughput and low delay, and the cluster runs well. (3) ALS (least square) collaborative filtering recommendation algorithm is recommended by matrix decomposition. It's calculated by synthesizing a lot of user rating data, And store a large number of feature matrices. Hadoop HA (high availability) used to solve the HDFS distributed file system NameNode single point problem. Spark as a new memory based distributed big data computing framework. Excellent computing performance. In this paper, the Hadoop big data platform under HA is constructed based on QJM (Quorum Journal Manager), and the ALS collaborative filtering algorithm is studied on the basis of Spark computing framework to realize the parallel running of ALS based collaborative filtering algorithm on Spark. The comparison experiment with ALS collaborative filtering algorithm based on MapReduce based on Hadoop on Netflix dataset shows that the parallel computing efficiency of ALS collaborative filtering algorithm based on Spark platform is obviously improved and it is more suitable to deal with mass data.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 趙玲玲;劉杰;王偉;;基于Spark的流程化機(jī)器學(xué)習(xí)分析方法[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2016年12期
2 武海麗;李彩玲;;基于Google云計(jì)算的在線學(xué)習(xí)系統(tǒng)設(shè)計(jì)研究[J];山西煤炭管理干部學(xué)院學(xué)報(bào);2016年04期
3 岑凱倫;于紅巖;楊騰霄;;大數(shù)據(jù)下基于Spark的電商實(shí)時(shí)推薦系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];現(xiàn)代計(jì)算機(jī)(專業(yè)版);2016年24期
4 海沫;;大數(shù)據(jù)聚類算法綜述[J];計(jì)算機(jī)科學(xué);2016年S1期
5 原默晗;唐晉韜;王挺;;一種高效的分布式相似短文本聚類算法[J];計(jì)算機(jī)與數(shù)字工程;2016年05期
6 劉澤q,
本文編號(hào):2205752
本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2205752.html
最近更新
教材專著