網(wǎng)絡(luò)熱點(diǎn)話題實(shí)時(shí)發(fā)現(xiàn)技術(shù)研究與實(shí)現(xiàn)
本文關(guān)鍵詞: 熱點(diǎn)話題發(fā)現(xiàn) 網(wǎng)絡(luò)爬蟲(chóng) 中文分詞 極大團(tuán)挖掘 話題展示平臺(tái) 出處:《北京郵電大學(xué)》2014年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)、社交平臺(tái)以及移動(dòng)技術(shù)的飛速發(fā)展,人們?cè)絹?lái)越多的和網(wǎng)絡(luò)接觸,并在互聯(lián)網(wǎng)上和他人分享自己的觀點(diǎn)。人們每天所關(guān)心的、談?wù)摰膬?nèi)容即是本文提及的熱點(diǎn)話題。熱點(diǎn)話題可以在政治、經(jīng)濟(jì)、文化等領(lǐng)域發(fā)揮重要的作用,所以對(duì)熱點(diǎn)話題實(shí)時(shí)發(fā)現(xiàn)技術(shù)的研究具有很高的應(yīng)用價(jià)值。本論文以此為出發(fā)點(diǎn),研究了熱點(diǎn)話題實(shí)時(shí)發(fā)現(xiàn)相關(guān)技術(shù),并實(shí)現(xiàn)了熱點(diǎn)話題實(shí)時(shí)發(fā)現(xiàn)系統(tǒng)。 本文的主要工作如下: 第一,設(shè)計(jì)并完成了熱點(diǎn)話題實(shí)時(shí)發(fā)現(xiàn)系統(tǒng),可以查看熱點(diǎn)話題詳情; 第二,提出并實(shí)現(xiàn)了基于模板的爬蟲(chóng)采集技術(shù),并應(yīng)用于系統(tǒng)的信息采集模塊,解決了對(duì)網(wǎng)絡(luò)新聞和微博數(shù)據(jù)的爬取問(wèn)題,能夠高效的采集數(shù)據(jù); 第三,提出并實(shí)現(xiàn)了基于詞頻的中文分詞法,并應(yīng)用于文本預(yù)處理模塊,改善了分詞過(guò)程中的歧義問(wèn)題,能夠得到更為準(zhǔn)確的分詞結(jié)果; 第四,提出了一種改進(jìn)的準(zhǔn)極大團(tuán)挖掘方法,并應(yīng)用于話題提取模塊,解決了極大團(tuán)中相似話題的合并問(wèn)題,能夠得到更為準(zhǔn)確的話題。 本文論述的系統(tǒng)可以高效的采集新聞、微博數(shù)據(jù),并成功進(jìn)行了文本預(yù)處理和話題的提取,最后在前端平臺(tái)展示結(jié)果。該系統(tǒng)具有較高的實(shí)際應(yīng)用價(jià)值。
[Abstract]:With the rapid development of the Internet, social platforms and mobile technology, people are more and more in contact with the Internet and share their views with others on the Internet. What we are talking about is a hot topic mentioned in this paper. Hot topics can play an important role in the fields of politics, economy, culture and so on. Therefore, the research on real-time discovery of hot topics has high application value. This paper studies the related technologies of real-time discovery of hot topics, and realizes the real-time discovery system of hot topics. The main work of this paper is as follows:. First, a real-time hot topic discovery system is designed and completed, which can view the details of hot topic. Secondly, the crawler acquisition technology based on template is put forward and implemented, and it is applied to the information collection module of the system, which solves the crawling problem of network news and Weibo data, and can collect data efficiently. Thirdly, the Chinese word segmentation method based on word frequency is put forward and implemented, and it is applied to the text preprocessing module, which improves the ambiguity in the segmentation process and can get more accurate segmentation results. In 4th, an improved quasi-maximal cluster mining method is proposed and applied to the topic extraction module, which solves the problem of merging similar topics in the maximal cluster and can obtain more accurate topics. The system discussed in this paper can collect news and Weibo data efficiently, and successfully carry out text preprocessing and topic extraction, and finally display the results on the front-end platform. The system has high practical application value.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 趙穎斯;劉云;;BBS輿情系統(tǒng)的數(shù)據(jù)采集方法[J];電信快報(bào);2008年12期
2 費(fèi)洪曉,康松林,朱小娟,謝文彪;基于詞頻統(tǒng)計(jì)的中文分詞的研究[J];計(jì)算機(jī)工程與應(yīng)用;2005年07期
3 姚全珠;宋志理;彭程;;基于LDA模型的文本分類研究[J];計(jì)算機(jī)工程與應(yīng)用;2011年13期
4 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲(chóng):研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
5 鄭魁;疏學(xué)明;袁宏永;;網(wǎng)絡(luò)輿情熱點(diǎn)信息自動(dòng)發(fā)現(xiàn)方法[J];計(jì)算機(jī)工程;2010年03期
6 高潔,吉根林;文本分類技術(shù)研究[J];計(jì)算機(jī)應(yīng)用研究;2004年07期
7 張曉艷;王挺;;話題發(fā)現(xiàn)與追蹤技術(shù)研究[J];計(jì)算機(jī)科學(xué)與探索;2009年04期
8 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期
9 洪宇;張宇;劉挺;李生;;話題檢測(cè)與跟蹤的評(píng)測(cè)及研究綜述[J];中文信息學(xué)報(bào);2007年06期
10 肖波;徐前方;藺志青;郭軍;李春光;;可信關(guān)聯(lián)規(guī)則及其基于極大團(tuán)的挖掘算法[J];軟件學(xué)報(bào);2008年10期
,本文編號(hào):1490528
本文鏈接:http://www.sikaile.net/jingjilunwen/zhengzhijingjixuelunwen/1490528.html