基于MapReduce的分布式改進(jìn)隨機(jī)森林學(xué)生就業(yè)數(shù)據(jù)分類模型研究
發(fā)布時(shí)間:2018-04-08 07:27
本文選題:機(jī)器學(xué)習(xí) 切入點(diǎn):數(shù)據(jù)分類模型 出處:《系統(tǒng)工程理論與實(shí)踐》2017年05期
【摘要】:教育數(shù)據(jù)挖掘(educational data mining)是當(dāng)代教育信息化發(fā)展的前沿研究領(lǐng)域,正在吸引越來越多教育學(xué)家和數(shù)據(jù)科學(xué)家的關(guān)注."大數(shù)據(jù)"時(shí)代背景下,隨著數(shù)據(jù)處理規(guī)模的不斷激增,現(xiàn)有的數(shù)據(jù)挖掘模型在單一處理節(jié)點(diǎn)的計(jì)算能力遭遇瓶頸,各類面向大數(shù)據(jù)處理的分布式計(jì)算框架應(yīng)運(yùn)而生.借助這些框架,面向解決高校就業(yè)數(shù)據(jù)挖掘問題的機(jī)器學(xué)習(xí)模型便可以滿足未來大規(guī)模數(shù)據(jù)處理的需求,在未來數(shù)據(jù)集體量龐大的信息集成系統(tǒng)中為數(shù)據(jù)挖掘和決策支持提供幫助.以此為背景,本研究對(duì)比現(xiàn)有數(shù)據(jù)模型對(duì)研究目標(biāo)對(duì)象的分類性能,提出了以引入輸入特征加權(quán)系數(shù)來計(jì)算特征的信息增益作為特征最優(yōu)分裂評(píng)判指標(biāo)的改進(jìn)隨機(jī)森林模型來提升數(shù)據(jù)分類性能,通過仿真測試改進(jìn)模型對(duì)于現(xiàn)有模型分類性能的提升情況,與此同時(shí)為解決大數(shù)據(jù)時(shí)代背景下面向海量數(shù)據(jù)分類任務(wù)的單節(jié)點(diǎn)性能瓶頸問題,提出了基于分布式改進(jìn)隨機(jī)森林算法的大規(guī)模學(xué)生就業(yè)數(shù)據(jù)分類預(yù)測模型.通過使用MapReduce分布式計(jì)算框架實(shí)現(xiàn)已訓(xùn)練模型在本地磁盤與分布式文件系統(tǒng)之間的序列化寫入與反序列化加載過程,進(jìn)而實(shí)現(xiàn)了基于改進(jìn)隨機(jī)森林模型的大規(guī)模數(shù)據(jù)分類模型的分布式擴(kuò)展.
[Abstract]:Educational data mining (EDM) is a frontier research field in the development of modern educational informatization, which is attracting more and more attention of educators and data scientists. "Under the background of big data, with the rapid increase of data processing scale, the computing power of existing data mining models in a single processing node has met a bottleneck, and various distributed computing frameworks for big data processing have emerged as the times require.With these frameworks, the machine learning model for solving the problem of employment data mining in colleges and universities can meet the needs of large-scale data processing in the future.It is helpful for data mining and decision support in the information integration system with large volume of data sets in the future.Against this background, this study compares the classification performance of the existing data models to the target objects.An improved stochastic forest model is proposed in which the information gain of the feature is calculated by introducing the weighted coefficient of the input feature as the index of feature optimal split evaluation to improve the performance of data classification.In order to solve the problem of single node performance bottleneck of mass data classification task in big data era, the improved model improves the classification performance of existing models through simulation test, and at the same time, in order to solve the bottleneck of single node performance in the context of big data era,Based on distributed improved stochastic forest algorithm, a large scale student employment data classification and prediction model is proposed.The serialization writing and deserialization loading process of the trained model between the local disk and the distributed file system is realized by using the MapReduce distributed computing framework.Then the distributed extension of large-scale data classification model based on improved stochastic forest model is realized.
【作者單位】: 同濟(jì)大學(xué)電子與信息工程學(xué)院CIMS中心;
【基金】:國家自然科學(xué)基金(71690234)~~
【分類號(hào)】:G647.38;TP311.13
,
本文編號(hào):1720601
本文鏈接:http://www.sikaile.net/jiaoyulunwen/gaodengjiaoyulunwen/1720601.html
最近更新
教材專著