分布式環(huán)境下企業(yè)新聞信息分類子系統(tǒng)的設計與實現(xiàn)

發(fā)布時間：2018-08-27 09:03

【摘要】：近年來,隨著互聯(lián)網(wǎng)的迅猛發(fā)展,各種各樣的新聞層出不窮,新聞信息在人們的文化、生活等各個方面發(fā)揮著越來越重要的作用。如何對大量的新聞數(shù)據(jù)進行收集、整理,并突顯出人們想要查找的新聞,是本文研究的主要問題。針對目前常見的搜索引擎存在著查找到的新聞信息過多,與主題關聯(lián)性不強等問題,本文提出并設計了一個面向企業(yè)的新聞分類子系統(tǒng)。該系統(tǒng)具備新聞采集、信息處理及新聞展示等功能。企業(yè)用戶可以利用該系統(tǒng)快速、準確地獲取與其行業(yè)相關的新聞。首先,系統(tǒng)設計了網(wǎng)絡爬蟲模塊。使用廣度優(yōu)先算法編寫了爬蟲軟件,通過該軟件可以實現(xiàn)對企業(yè)感興趣新聞信息高效的采集與識別。其次,設計并實現(xiàn)了文本分類模塊。在該模塊中,使用分布式貝葉斯算法對新聞文本進行分類。在分類過程中,文本的預處理、特征選擇以及向量化需要大量計算;在模型訓練時,也存在著訓練時間長、數(shù)據(jù)庫存儲容量有限等問題。為了解決以上問題,本文搭建了 Hadoop分布式計算平臺,利用MapReduce并行計算模型對文本分類過程中的不同階段進行了分布式并行處理,并建立Hive數(shù)據(jù)倉庫以解決占用存儲空間大的問題。當面臨大量新增數(shù)據(jù)時,傳統(tǒng)的貝葉斯方法需要將之前的所有樣本數(shù)據(jù)全部重新學習一次,這樣不僅會耗費大量時間,而且操作起來也相當麻煩。針對這種情況,本文引用了傳統(tǒng)的增量學習方法,設計并實現(xiàn)了增量式貝葉斯算法,該方法不用重新訓練數(shù)據(jù),只需對原有的數(shù)據(jù)進行修正。最后設計了一個面向企業(yè)新聞信息的分類子系統(tǒng),主要包括信息采集、文本預處理、特征提取、分類器構造、分類性能評估和增量學習幾個流程,并對系統(tǒng)的幾個模塊功能進行了測試。本系統(tǒng)利用爬蟲進行新聞信息的獲取,并在Hadoop環(huán)境下對新聞信息進行分類。通過測試表明,在大規(guī)模新聞信息的情況下,Hadoop下的增量分類器相比于傳統(tǒng)的貝葉斯分類器算法準確率提高4%左右,表現(xiàn)出了良好的執(zhí)行效率及較高的拓展性。本文給出了網(wǎng)絡新聞文本分類的實現(xiàn)方案,對其它領域的文本分類具有借鑒意義。
[Abstract]:In recent years, with the rapid development of the Internet, all kinds of news emerge in endlessly. News information plays a more and more important role in people's culture, life and other aspects. How to collect, sort out and highlight the news that people want to find is the main problem of this paper. Aiming at the problems of finding too much news information and not strong relevance to the topic in the common search engines, this paper proposes and designs an enterprise-oriented news classification subsystem. The system has the functions of news collection, information processing and news display. Enterprise users can use the system to quickly and accurately access news related to their industry. Firstly, the network crawler module is designed. The crawler software is programmed by using the breadth-first algorithm, through which the information of interest to enterprises can be collected and recognized efficiently. Secondly, the text classification module is designed and implemented. In this module, distributed Bayesian algorithm is used to classify news texts. In the process of classification, text preprocessing, feature selection and vectorization need a lot of computation, while in model training, there are many problems such as long training time and limited storage capacity of database. In order to solve the above problems, the Hadoop distributed computing platform is built, and the MapReduce parallel computing model is used to process the different stages of text classification. Hive data warehouse is established to solve the problem of occupying large storage space. When faced with a large number of new data, the traditional Bayesian method needs to re-learn all the previous sample data, which will not only consume a lot of time, but also be very troublesome to operate. In this paper, the traditional incremental learning method is cited, and an incremental Bayesian algorithm is designed and implemented. The method does not need to retrain the data, but only needs to modify the original data. Finally, a classification subsystem for enterprise news information is designed, which includes information collection, text preprocessing, feature extraction, classifier construction, classification performance evaluation and incremental learning. Several module functions of the system are tested. This system uses crawler to obtain news information, and classifies news information under Hadoop environment. The test results show that the accuracy of Hadoop incremental classifier is about 4% higher than that of the traditional Bayesian classifier under the condition of large-scale news information. It shows good execution efficiency and high expansibility. This paper gives the implementation scheme of network news text classification, which can be used for reference in other fields.
【學位授予單位】：延邊大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.13;TP391.1

【相似文獻】

相關期刊論文前10條

1 楊靜;張健沛;劉大昕;;基于多支持向量機分類器的增量學習算法研究[J];哈爾濱工程大學學報;2006年01期

2 秦玉平;王秀坤;王春立;;實現(xiàn)兼類樣本類增量學習的一種算法[J];控制與決策;2009年01期

3 秦玉平;王秀坤;王春立;;實現(xiàn)兼類樣本增量學習的一種算法[J];計算機應用與軟件;2009年08期

4 秦玉平;陳一荻;王春立;王秀坤;;一種新的類增量學習方法[J];計算機工程與應用;2011年34期

5 時建中;程龍生;;基于增量學習系統(tǒng)的財務危機動態(tài)預警[J];技術經(jīng)濟;2012年05期

6 王洪波;趙光宙;齊冬蓮;盧達;;一類支持向量機的快速增量學習方法[J];浙江大學學報(工學版);2012年07期

7 秦玉平;倫淑嫻;王秀坤;;一種新的兼類樣本類增量學習算法[J];計算機科學;2012年09期

8 姜卯生,王浩,姚宏亮;樸素貝葉斯分類器增量學習序列算法研究[J];計算機工程與應用;2004年14期

9 劉梅,權太范,姚天賓;基于增量學習神經(jīng)模糊網(wǎng)絡的機動目標跟蹤[J];電子學報;2005年11期

10 李祥納;艾青;秦玉平;劉衛(wèi)江;;支持向量機增量學習算法綜述[J];渤海大學學報(自然科學版);2007年02期

相關會議論文前8條

1 秦亮;唐靜;史賢俊;肖支才;;一種改進的支持向量機增量學習算法[A];2011年中國智能自動化學術會議論文集（第一分冊）[C];2011年

2 羅長升;段建國;許洪波;郭莉;;基于拉推策略的文本分類增量學習研究[A];第三屆全國信息檢索與內(nèi)容安全學術會議論文集[C];2007年

3 張慶彬;吳惕華;劉波;;一種改進的基于群體的增量學習算法[A];第二十六屆中國控制會議論文集[C];2007年

4 張健沛;李忠偉;楊靜;;一種基于多支持向量機的并行增量學習方法(英文)[A];第二十二屆中國數(shù)據(jù)庫學術會議論文集（技術報告篇）[C];2005年

5 王悅凱;吳曉峰;翁巨揚;;Where-What網(wǎng)絡增量學習特性探究[A];第一屆全國神經(jīng)動力學學術會議程序手冊 & 論文摘要集[C];2012年

6 趙瑩;萬福永;;支持向量機的增量學習算法及其在多類分類問題中的應用[A];第25屆中國控制會議論文集（下冊）[C];2006年

7 劉欣;章勇;王娟;;增量學習的TFIDF_NB協(xié)同訓練分類算法[A];中國電子學會第十六屆信息論學術年會論文集[C];2009年

8 宮義山;錢娜;;貝葉斯網(wǎng)絡結構在線學習算法及應用[A];科學發(fā)展與社會責任（A卷）——第五屆沈陽科學學術年會文集[C];2008年

相關博士學位論文前4條

1 孫宇;針對含有概念漂移問題的增量學習算法研究[D];中國科學技術大學;2017年

2 李敬;增量學習及其在圖像識別中的應用[D];上海交通大學;2008年

3 段華;支持向量機的增量學習算法研究[D];上海交通大學;2008年

4 趙強利;基于選擇性集成的在線機器學習關鍵技術研究[D];國防科學技術大學;2010年

相關碩士學位論文前10條

1 郝運河;基于增量學習的復雜環(huán)境下道路識別算法研究[D];南京理工大學;2015年

2 李丹;基于馬氏超橢球?qū)W習機的增量學習算法研究[D];渤海大學;2015年

3 趙翠翠;基于RBF神經(jīng)網(wǎng)絡的集成增量學習方法研究[D];河北工業(yè)大學;2015年

4 王會波;基于支持向量機的混合增量學習算法與應用[D];華中師范大學;2016年

5 張健;增量學習在電子鼻智能烘烤系統(tǒng)中的應用研究[D];重慶大學;2016年

6 曾舒如;基于多模態(tài)增量學習模型的目標物體檢測方法研究[D];南昌大學;2016年

7 潘振春;基于實例的領域適應增量學習方法研究[D];南京理工大學;2017年

8 劉國欣;基于增量學習SVM分類算法的研究與應用[D];中北大學;2017年

9 杜玲;覆蓋算法的增量學習研究[D];安徽大學;2010年

10 張智敏;基于增量學習的分類算法研究[D];華南理工大學;2010年

，

本文編號：2206764

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2206764.html

上一篇：音樂情感參數(shù)化系統(tǒng)的研究與實現(xiàn)
下一篇：面向科技創(chuàng)新的科研人員信息需求的調(diào)查與分析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分布式環(huán)境下企業(yè)新聞信息分類子系統(tǒng)的設計與實現(xiàn)