天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

面向工業(yè)大數(shù)據(jù)的分布式ETL系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-08-26 10:59
【摘要】:自從進(jìn)入工業(yè)4.0時(shí)代以來(lái),由于互聯(lián)網(wǎng)和計(jì)算機(jī)技術(shù)的高速發(fā)展,在與工業(yè)系統(tǒng)深度融合過程中引發(fā)的生產(chǎn)力、生產(chǎn)關(guān)系、生產(chǎn)技術(shù)、商業(yè)模式以及創(chuàng)新模式等方面的深度變革,使整個(gè)工業(yè)系統(tǒng)邁向全面智能化的革命性轉(zhuǎn)變。工業(yè)大數(shù)據(jù)分析是未來(lái)工業(yè)在全球市場(chǎng)中發(fā)揮競(jìng)爭(zhēng)優(yōu)勢(shì)的關(guān)鍵領(lǐng)域。隨著物聯(lián)網(wǎng)和信息物理系統(tǒng)時(shí)代的來(lái)臨,更多數(shù)據(jù)可以被收集和分析,并用于做出更明智的決策。在整個(gè)工業(yè)大數(shù)據(jù)分析的過程中,歷史數(shù)據(jù)如何從各個(gè)數(shù)據(jù)源匯聚到分析系統(tǒng)中、實(shí)時(shí)數(shù)據(jù)如何從各個(gè)傳感器加載到分析系統(tǒng)中成為整個(gè)數(shù)據(jù)分析的基礎(chǔ)。這就要用到數(shù)據(jù)處理工具ETL(Extract-Transform-Load,抽取、轉(zhuǎn)換、加載)。傳統(tǒng)的ETL多是在單機(jī)系統(tǒng)下并行運(yùn)行,其處理速度和處理量遠(yuǎn)遠(yuǎn)不能滿足工業(yè)數(shù)據(jù)分析的要求。而商業(yè)ETL性能好,但是價(jià)格昂貴,而且對(duì)硬件系統(tǒng)的要求太高,無(wú)法做到普及。針對(duì)以上情況,本文針對(duì)工業(yè)數(shù)據(jù)處理設(shè)計(jì)并實(shí)現(xiàn)了一種價(jià)格低廉、性能高的分布式ETL系統(tǒng)。本文分布式ETL系統(tǒng)的設(shè)計(jì)主要分三個(gè)模塊展開:數(shù)據(jù)抽取模塊、數(shù)據(jù)轉(zhuǎn)換模塊以及數(shù)據(jù)加載模塊。數(shù)據(jù)抽取階段主要設(shè)計(jì)了基于分表觸發(fā)器的變更數(shù)據(jù)捕獲方案、基于數(shù)據(jù)校驗(yàn)的差異數(shù)據(jù)同步方案和基于Redis的Pub/Sub通信模式的實(shí)時(shí)數(shù)據(jù)抽取方案。數(shù)據(jù)轉(zhuǎn)換階段主要根據(jù)數(shù)據(jù)對(duì)處理速度和處理量的要求分別設(shè)計(jì)了批處理層和加速層,批處理層主要處理對(duì)實(shí)時(shí)性要求不高的歷史數(shù)據(jù),基于Hadoop的MapReduce實(shí)現(xiàn);加速層主要處理的實(shí)時(shí)數(shù)據(jù),基于Spark Streaming流處理方式實(shí)現(xiàn)。數(shù)據(jù)加載階段主要由Sqoop來(lái)處理結(jié)構(gòu)化數(shù)據(jù)的加載、由HDFS客戶端來(lái)處理非結(jié)構(gòu)化數(shù)據(jù)的加載。最后本文對(duì)設(shè)計(jì)的分布式ETL系統(tǒng)分別進(jìn)行了功能測(cè)試和性能測(cè)試。試驗(yàn)結(jié)果表明,本文設(shè)計(jì)的ETL系統(tǒng)在處理工業(yè)大數(shù)據(jù)的問題上具有較好的性能,這對(duì)工業(yè)數(shù)據(jù)的信息化改造具有較強(qiáng)的實(shí)際意義。
[Abstract]:Because of the rapid development of the Internet and computer technology, the productivity, relations of production, and production technology caused by the deep integration with the industrial system have been increased since the beginning of the 4.0 era of industry. The deep transformation of business model and innovation mode makes the whole industrial system move toward the revolutionary transformation of full intelligence. Industry big data analysis is the future industry in the global market play a key area of competitive advantage. With the advent of the Internet of things and the age of information physics systems, more data can be collected, analyzed, and used to make more informed decisions. In the whole process of big data's analysis, how the historical data converge from the various data sources to the analysis system, and how the real-time data is loaded into the analysis system from each sensor becomes the basis of the whole data analysis. This will use the data processing tool ETL (Extract-Transform-Load, extraction, transformation, loading). The traditional ETL is mostly run in parallel in a single computer system, and its processing speed and processing capacity are far from meeting the requirements of industrial data analysis. The commercial ETL performance is good, but the price is expensive, and the request to the hardware system is too high, cannot achieve the popularization. In view of the above situation, this paper designs and implements a low price and high performance distributed ETL system for industrial data processing. The design of distributed ETL system is divided into three modules: data extraction module, data conversion module and data loading module. In the stage of data extraction, we mainly design change data capture scheme based on table trigger, differential data synchronization scheme based on data verification and real-time data extraction scheme based on Pub/Sub communication mode based on Redis. In the data conversion stage, the batch layer and the acceleration layer are designed according to the requirements of the data processing speed and the processing capacity, respectively. The batch layer mainly processes the historical data with low real-time requirements, and the MapReduce based on Hadoop is implemented. The real-time data processing in acceleration layer is based on Spark Streaming stream processing. In the data loading stage, the loading of structured data is mainly handled by Sqoop, and the loading of unstructured data is handled by HDFS client. Finally, the function and performance of the distributed ETL system are tested. The experimental results show that the ETL system designed in this paper has better performance in dealing with the problem of industrial big data, which has a strong practical significance for the information transformation of industrial data.
【學(xué)位授予單位】:中國(guó)科學(xué)院大學(xué)(中國(guó)科學(xué)院沈陽(yáng)計(jì)算技術(shù)研究所)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前8條

1 文亞;;德國(guó)公共風(fēng)險(xiǎn)管理的經(jīng)驗(yàn)與啟示[J];中國(guó)行政管理;2015年04期

2 鄭軍;尹兆濤;;中國(guó)石油應(yīng)對(duì)“大數(shù)據(jù)”的策略分析[J];石油規(guī)劃設(shè)計(jì);2013年06期

3 宋杰;郝文寧;陳剛;靳大尉;趙水寧;;基于MapReduce的分布式ETL體系結(jié)構(gòu)研究[J];計(jì)算機(jī)科學(xué);2013年06期

4 段成;王增平;吳克河;;一種輕量級(jí)電網(wǎng)實(shí)時(shí)數(shù)據(jù)ETL系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];電力系統(tǒng)保護(hù)與控制;2010年18期

5 戴浩;楊波;;ETL中的數(shù)據(jù)增量抽取機(jī)制研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年23期

6 馬瑞新;許力;;基于SOA的實(shí)時(shí)ETL的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與科學(xué);2007年08期

7 祁利剛;候小靜;;基于數(shù)據(jù)倉(cāng)庫(kù)的ETL技術(shù)研究[J];中國(guó)電力教育;2006年S1期

8 章水鑫,徐宏炳,于立;增量式ETL工具的研究與實(shí)現(xiàn)[J];現(xiàn)代計(jì)算機(jī)(專業(yè)版);2005年03期

相關(guān)碩士學(xué)位論文 前10條

1 林建昌;電力行業(yè)分布式ETL數(shù)據(jù)集成系統(tǒng)研究與實(shí)現(xiàn)[D];電子科技大學(xué);2015年

2 陳洪江;MapReduce下容錯(cuò)機(jī)制的研究與優(yōu)化[D];哈爾濱工業(yè)大學(xué);2014年

3 趙賽;云存儲(chǔ)中基于動(dòng)態(tài)多中心的分布式文件系統(tǒng)研究[D];燕山大學(xué);2014年

4 李W,

本文編號(hào):2204665


資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2204665.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶a07b6***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com