天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于hadoop的微博文本分類及商業(yè)詞抽取

發(fā)布時間:2019-02-19 18:49
【摘要】:隨著計算機(jī)技術(shù)和網(wǎng)絡(luò)技術(shù)的飛速發(fā)展,微博已經(jīng)普及成為國內(nèi)的一大新型媒體。微博用戶基數(shù)的迅速膨脹,加上信息的逐級傳播,與之俱來的問題是微博數(shù)據(jù)規(guī)模達(dá)到空前水平。面對微博服務(wù)迅猛發(fā)展中所產(chǎn)生的海量文本數(shù)據(jù),如何準(zhǔn)確有效的從中定向發(fā)現(xiàn)并獲取所需要的有較高商業(yè)價值的資料和信息,進(jìn)而提高廣告精準(zhǔn)度成為各微博平臺數(shù)據(jù)研究處理的一大目標(biāo),本文將對如何有效的從海量微博文本數(shù)據(jù)中發(fā)現(xiàn)和抽取商業(yè)關(guān)鍵詞進(jìn)行研究。為了更有針對性的進(jìn)行商業(yè)關(guān)鍵詞抽取,首先對海量微博數(shù)據(jù)進(jìn)行了文本分類,一方面降低了單次數(shù)據(jù)處理的規(guī)模,另一方面對同類數(shù)據(jù)進(jìn)行處理研究將更有針對性。再對各個類別中文本的關(guān)鍵詞結(jié)合互聯(lián)網(wǎng)搜索引擎中的搜索權(quán)值進(jìn)行調(diào)權(quán),有效提高了微博文本中商業(yè)關(guān)鍵詞抽取的精準(zhǔn)度。 由于微博文本數(shù)據(jù)具有總體數(shù)量多、單條簡短及內(nèi)容隨意性大等特性,在利用傳統(tǒng)分類方法及商業(yè)信息提取算法對其進(jìn)行處理時存在一定的局限性。本文考慮到單條微博文本信息簡短包含的有效特征少,且內(nèi)容比較口語化的特性,從相似詞及搭配詞方面對文本的特征詞進(jìn)行了擴(kuò)展,盡量降低特征丟失的可能性;結(jié)合微博文本數(shù)量多及內(nèi)容隨意性大的特性,提出了一種基于特征詞類別分散性及分散程度的微博文本分類方法?紤]到微博自有的轉(zhuǎn)發(fā)數(shù)、評論數(shù)及海量規(guī)模等因素,本文對傳統(tǒng)的TF-IDF算法進(jìn)行了相關(guān)改進(jìn),利用hadoop云計算平臺并以單個用戶的所有微博信息作為計算單元應(yīng)用改進(jìn)的TF-IDF算法,再綜合詞語在互聯(lián)網(wǎng)搜索引擎中的搜索權(quán)值進(jìn)行調(diào)權(quán),實現(xiàn)了從海量數(shù)據(jù)中對具有商業(yè)價值關(guān)鍵詞的有效抽取。實驗表明,該微博分類方法在微博信息的分類中取得了較好的效果,在微博數(shù)據(jù)處理應(yīng)用場景中,綜合了改進(jìn)的TF-IDF權(quán)重及詞語互聯(lián)網(wǎng)搜索權(quán)重的商業(yè)關(guān)鍵詞抽取算法,,具有較好的適用性及商業(yè)效果。而結(jié)合了云計算平臺后,一定程度上提高了數(shù)據(jù)處理效率,使得對海量微博數(shù)據(jù)集上的處理變得可行有效。
[Abstract]:With the rapid development of computer technology and network technology, Weibo has become a new media in China. Weibo's rapid expansion of the user base, coupled with the gradual dissemination of information, comes with the question of the unprecedented scale of Weibo data. In the face of the massive text data produced by Weibo in the rapid development of service, how to accurately and effectively find and obtain the materials and information of high commercial value needed from them, To improve the accuracy of advertising has become a major target of data processing in Weibo platform. This paper will study how to effectively find and extract commercial keywords from the massive Weibo text data. In order to extract business keywords more pertinently, the text classification of massive Weibo data is carried out first, which reduces the scale of single data processing on the one hand, and studies the same data processing on the other hand, it will be more targeted. Then the key words of each type of Chinese text combined with the search weight value in the Internet search engine are adjusted to effectively improve the accuracy of business keyword extraction in Weibo text. Because Weibo text data has many characteristics, such as large quantity, short and random content, there are some limitations in using traditional classification method and business information extraction algorithm to process Weibo text data. Considering that there are few effective features and colloquial features in a single Weibo text, this paper extends the feature words of the text from the aspects of similar words and collocation words to reduce the possibility of feature loss as far as possible. According to the characteristics of Weibo's large quantity of text and randomness of content, this paper puts forward a new text categorization method of Weibo based on the dispersion and dispersion of feature word categories. Considering the factors of Weibo's own forwarding number, comment number and massive scale, this paper improves the traditional TF-IDF algorithm. Using hadoop cloud computing platform and taking all Weibo information of individual user as computing unit, the improved TF-IDF algorithm is applied, and then the search weight value of words in Internet search engine is synthesized to adjust the weight. The effective extraction of commercial value keywords from massive data is realized. The experiment shows that the Weibo classification method has achieved good results in the classification of Weibo information. In the data processing and application scene of Weibo, the improved business keyword extraction algorithm of TF-IDF weight and word Internet search weight is integrated. It has good applicability and commercial effect. Combined with cloud computing platform, the efficiency of data processing is improved to a certain extent, which makes it feasible and effective to deal with the massive Weibo data set.
【學(xué)位授予單位】:杭州電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 李華;趙文偉;;微博客:圖書館的下一個網(wǎng)絡(luò)新貴工具[J];圖書與情報;2009年04期

2 谷瓊;朱莉;蔡之華;袁紅星;;基于決策樹技術(shù)的高校研究生信息庫數(shù)據(jù)挖掘研究[J];電子技術(shù)應(yīng)用;2006年01期

3 李靜梅,孫麗華,張巧榮,張春生;一種文本處理中的樸素貝葉斯分類器[J];哈爾濱工程大學(xué)學(xué)報;2003年01期

4 張寧,賈自艷,史忠植;使用KNN算法的文本分類[J];計算機(jī)工程;2005年08期

5 洪家榮,丁明峰,李星原,王麗薇;一種新的決策樹歸納學(xué)習(xí)算法[J];計算機(jī)學(xué)報;1995年06期

6 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實體識別研究[J];中文信息學(xué)報;2006年01期

7 吳軍,王作英,禹鋒,王俠;漢語語料的自動分類[J];中文信息學(xué)報;1995年04期

8 劉開瑛,薛翠芳,鄭家恒,周曉強(qiáng);中文文本中抽取特征信息的區(qū)域與技術(shù)[J];中文信息學(xué)報;1998年02期

9 于瀟;;Web2.0時代下微博廣告?zhèn)鞑ゲ呗苑治鯷J];新聞界;2011年03期

10 曹玉;;2010微博營銷10案例[J];科技與企業(yè);2011年03期



本文編號:2426761

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2426761.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶b9052***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com