基于SPARK的用戶特征分析
發(fā)布時(shí)間:2018-10-29 16:22
【摘要】:近年來,互聯(lián)網(wǎng)的飛速發(fā)展提供了一個(gè)豐富便捷的網(wǎng)絡(luò)環(huán)境,人們?cè)絹碓搅?xí)慣在網(wǎng)絡(luò)上進(jìn)行交流、交易、娛樂等等,海量的用戶網(wǎng)絡(luò)數(shù)據(jù)充斥著整個(gè)互聯(lián)網(wǎng),越來越多的人看到了大數(shù)據(jù)背后隱藏的價(jià)值,全球范圍內(nèi)掀起來大數(shù)據(jù)研究的浪潮;隨著大數(shù)據(jù)技術(shù)的火熱研究,吸引了國(guó)內(nèi)外眾多學(xué)者投入到大數(shù)據(jù)挖掘的研究中,實(shí)現(xiàn)了基于用戶網(wǎng)絡(luò)行為數(shù)據(jù)的分析挖掘的研究體系。大數(shù)據(jù)計(jì)算平臺(tái)并不需要使用超高性能的服務(wù)器才能實(shí)現(xiàn),使用普通的PC即可搭建而成,并且這種集群化的模式表現(xiàn)出的計(jì)算性能往往比超高性能的服務(wù)器還要好。以Spark為代表的分布式計(jì)算平臺(tái)是近幾年剛剛興起并且快速發(fā)展的一種新技術(shù),原因在于這種分布式平臺(tái)是基于內(nèi)存的計(jì)算模式,可以提供海量存儲(chǔ)和超級(jí)計(jì)算的能力。把分析挖掘超大數(shù)據(jù)集的任務(wù)使用云計(jì)算方案來解決,能夠極大地提升計(jì)算速度和用戶分類的效能。因此,以Spark為代表的分布式計(jì)算平臺(tái)和海量用戶數(shù)據(jù)集的分類挖掘相融合,會(huì)是一個(gè)很有科研價(jià)值和應(yīng)用潛力的研究方向。本文主要研究基于Spark和改進(jìn)的TF-IDF算法的用戶特征分析,具體工作如下:1、研究了 Spark的相關(guān)技術(shù)以及Spark集群的搭建過程。使用樸素貝葉斯分類算法,結(jié)合Spark內(nèi)存計(jì)算框架,對(duì)用戶觀看視頻及次數(shù)信息進(jìn)行分析,建立用戶性別和年齡區(qū)間的分類模型;并進(jìn)一步介紹了整個(gè)分析系統(tǒng)的架構(gòu)。2、在基本的分類算法中,并沒考慮特征項(xiàng)權(quán)重問題,這樣并不能體現(xiàn)出每一個(gè)特征項(xiàng)的價(jià)值,基于這一因素,采用傳統(tǒng)的TF-IDF權(quán)重進(jìn)行進(jìn)一步實(shí)驗(yàn),與基本的分類算法對(duì)比分類效果。3、列出傳統(tǒng)的TF-IDF權(quán)重計(jì)算方法的缺陷,僅僅考慮特征項(xiàng)自身的價(jià)值,而沒有體現(xiàn)特征項(xiàng)與類別之間的相關(guān)性;針對(duì)這一問題,提出了一種基于特征項(xiàng)與類別間相關(guān)性的TFC-IDFC權(quán)重計(jì)算方法,并詳細(xì)介紹了優(yōu)化分類模型的過程,通過實(shí)驗(yàn)得出分類結(jié)果。4、將改進(jìn)的權(quán)重計(jì)算方法與基本分類算法和傳統(tǒng)的TF-IDF權(quán)重計(jì)算方法進(jìn)行比較,通過正確率和F1值兩個(gè)指標(biāo),證明考慮到特征項(xiàng)與類別的相關(guān)性所提出的TFC-IDFC權(quán)重使得分類模型的分類能力更好。
[Abstract]:In recent years, the rapid development of the Internet has provided a rich and convenient network environment. People are more and more used to communicate, trade, entertain and so on the network. More and more people have seen the hidden value behind big data, and the wave of research has been raised in the whole world. With the hot research of big data technology, many scholars at home and abroad have been attracted to the research of big data mining, and realized the research system of analysis and mining based on user network behavior data. Big data computing platform does not need to use ultra-high performance server to achieve, using ordinary PC can be built, and this cluster mode often shows better computing performance than ultra-high performance server. The distributed computing platform, represented by Spark, is a new technology that has just emerged and developed rapidly in recent years. The reason is that the distributed platform is a memory-based computing model, which can provide mass storage and super computing capabilities. Using cloud computing to solve the task of analyzing and mining large data sets can greatly improve the computing speed and the efficiency of user classification. Therefore, the integration of the distributed computing platform represented by Spark and the classification and mining of massive user data sets will be a research direction with scientific research value and application potential. This paper mainly studies the user characteristics analysis based on Spark and improved TF-IDF algorithm. The main work is as follows: 1. The related technology of Spark and the process of building Spark cluster are studied. By using naive Bayesian classification algorithm and Spark memory computing framework, this paper analyzes the information of user watching video and times, and establishes the classification model of user's gender and age interval. And further introduced the structure of the whole analysis system. 2. In the basic classification algorithm, the weight of feature item is not considered, so it can not reflect the value of each feature item, based on this factor, The traditional TF-IDF weight is used for further experiments, and the classification effect is compared with the basic classification algorithm. 3. The defects of the traditional TF-IDF weight calculation method are listed, and only the value of the feature item itself is considered. It does not reflect the correlation between feature items and categories; In order to solve this problem, a TFC-IDFC weight calculation method based on the correlation between feature items and classes is proposed, and the process of optimizing classification model is introduced in detail. The improved weight calculation method is compared with the basic classification algorithm and the traditional TF-IDF weight calculation method. It is proved that the TFC-IDFC weight, which takes into account the correlation between feature items and categories, makes the classification model better.
【學(xué)位授予單位】:天津工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
本文編號(hào):2298200
[Abstract]:In recent years, the rapid development of the Internet has provided a rich and convenient network environment. People are more and more used to communicate, trade, entertain and so on the network. More and more people have seen the hidden value behind big data, and the wave of research has been raised in the whole world. With the hot research of big data technology, many scholars at home and abroad have been attracted to the research of big data mining, and realized the research system of analysis and mining based on user network behavior data. Big data computing platform does not need to use ultra-high performance server to achieve, using ordinary PC can be built, and this cluster mode often shows better computing performance than ultra-high performance server. The distributed computing platform, represented by Spark, is a new technology that has just emerged and developed rapidly in recent years. The reason is that the distributed platform is a memory-based computing model, which can provide mass storage and super computing capabilities. Using cloud computing to solve the task of analyzing and mining large data sets can greatly improve the computing speed and the efficiency of user classification. Therefore, the integration of the distributed computing platform represented by Spark and the classification and mining of massive user data sets will be a research direction with scientific research value and application potential. This paper mainly studies the user characteristics analysis based on Spark and improved TF-IDF algorithm. The main work is as follows: 1. The related technology of Spark and the process of building Spark cluster are studied. By using naive Bayesian classification algorithm and Spark memory computing framework, this paper analyzes the information of user watching video and times, and establishes the classification model of user's gender and age interval. And further introduced the structure of the whole analysis system. 2. In the basic classification algorithm, the weight of feature item is not considered, so it can not reflect the value of each feature item, based on this factor, The traditional TF-IDF weight is used for further experiments, and the classification effect is compared with the basic classification algorithm. 3. The defects of the traditional TF-IDF weight calculation method are listed, and only the value of the feature item itself is considered. It does not reflect the correlation between feature items and categories; In order to solve this problem, a TFC-IDFC weight calculation method based on the correlation between feature items and classes is proposed, and the process of optimizing classification model is introduced in detail. The improved weight calculation method is compared with the basic classification algorithm and the traditional TF-IDF weight calculation method. It is proved that the TFC-IDFC weight, which takes into account the correlation between feature items and categories, makes the classification model better.
【學(xué)位授予單位】:天津工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 王慶福;;貝葉斯網(wǎng)絡(luò)在用戶興趣模型構(gòu)建中的研究[J];無線互聯(lián)科技;2016年12期
2 龔靜;;基于Spark的用戶上網(wǎng)WAP日志分析[J];廣東通信技術(shù);2015年01期
3 周文瓊;王樂球;葉玫;;云環(huán)境下Hadoop平臺(tái)的作業(yè)調(diào)度算法[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2014年05期
4 何躍;鄧唯茹;張丹;;中文微博的情緒識(shí)別與分類研究[J];情報(bào)雜志;2014年02期
5 蔣在帆;王斌;;基于用戶行為分析的個(gè)人信息檢索研究[J];中文信息學(xué)報(bào);2011年01期
6 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學(xué)報(bào);2007年01期
7 慕春棣,tsinghua.edu.cn,戴劍彬,葉俊;用于數(shù)據(jù)挖掘的貝葉斯網(wǎng)絡(luò)[J];軟件學(xué)報(bào);2000年05期
,本文編號(hào):2298200
本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2298200.html
最近更新
教材專著