天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博輿情系統(tǒng)中數(shù)據(jù)采集技術(shù)研究

發(fā)布時(shí)間:2018-04-18 22:10

  本文選題:微博數(shù)據(jù) + 模擬登錄; 參考:《湘潭大學(xué)》2014年碩士論文


【摘要】:隨著互聯(lián)網(wǎng)的成熟和移動(dòng)互聯(lián)網(wǎng)的快速發(fā)展,越來越多的信息都被發(fā)布在網(wǎng)絡(luò)上,而且這種方式也逐漸的被大眾接受。網(wǎng)絡(luò)上的信息在一定程度上能反映民眾意向,但同時(shí)一些蠱惑性的話也能煽動(dòng)網(wǎng)民,因此網(wǎng)絡(luò)輿論在當(dāng)下社會(huì)中越來越受關(guān)注。為發(fā)展健康的網(wǎng)絡(luò)環(huán)境,有關(guān)政府部門需要對網(wǎng)絡(luò)輿情進(jìn)行有效預(yù)測、發(fā)現(xiàn)和疏通引導(dǎo)。而在網(wǎng)絡(luò)輿情領(lǐng)域中,,微博輿情備受關(guān)注,因?yàn)樵絹碓蕉嗟妮浨槭录际鞘紫仍谖⒉┥掀毓,然后在微博上傳播、討論從而形成輿情事件。從各級政府、企事業(yè)單位開通微博的動(dòng)作就能看出微博在網(wǎng)絡(luò)中的地位。 本文針對微博輿情系統(tǒng)中數(shù)據(jù)采集存在的若干問題進(jìn)行分析與研究,提出了通過模擬登錄采集網(wǎng)頁,然后輔以優(yōu)先隊(duì)列采來集微博上更有影響力的微博。本文主要完成以下工作: (1)就目前常用三種方法進(jìn)行分析:微博推送、基于微博API和網(wǎng)絡(luò)爬蟲。前兩種采集方法很難滿足輿情系統(tǒng)對微博數(shù)據(jù)在規(guī)模和實(shí)時(shí)性等方面的需求,最后一種則不容易采集到有用信息。為此,本文提出模擬瀏覽器登錄微博抓取網(wǎng)頁數(shù)據(jù)的方法,以方便地獲取任意微博用戶網(wǎng)頁上的數(shù)據(jù),并且能避開前兩種方法在數(shù)據(jù)采集速度上的限制。 (2)考慮到微博上用戶數(shù)目龐大,采集數(shù)據(jù)時(shí)會(huì)漏掉很多用戶。本文提出構(gòu)建微博用戶網(wǎng)絡(luò)的方法來解決該問題。首先,將每個(gè)微博用戶抽象為一個(gè)點(diǎn),用戶和用戶之間的粉絲、關(guān)注、轉(zhuǎn)發(fā)、評論等關(guān)系抽象為邊,將每種關(guān)系的量化值作為該邊上對應(yīng)關(guān)系權(quán)值。通過點(diǎn)和邊加入,就能構(gòu)建出一個(gè)巨大的微博用戶網(wǎng)絡(luò),這樣就能通過這個(gè)網(wǎng)絡(luò)不斷的發(fā)現(xiàn)新微博用戶,進(jìn)而能保證數(shù)據(jù)的完整性。 (3)為實(shí)現(xiàn)高效的微博數(shù)據(jù)采集,本文采用優(yōu)先隊(duì)列算法。高效采集數(shù)據(jù)是指在面對大量的數(shù)據(jù)時(shí),我們分層次的采集這些數(shù)據(jù),即先采集影響力大的用戶所發(fā)的微博,然后才是影響力較小的。為實(shí)現(xiàn)該功能,本文設(shè)計(jì)了優(yōu)先級的計(jì)算模型。綜合新浪微博對影響力用戶的定義和各種實(shí)際情況,篩選出粉絲數(shù)、關(guān)注數(shù)、活躍度、傳播力和時(shí)間戳這五個(gè)因子。以影響力為主要因子構(gòu)建優(yōu)先隊(duì)列,使得影響力越大的用戶數(shù)據(jù)采集頻率越高,同時(shí)還通過計(jì)算時(shí)間間隔兼顧非活躍用戶的數(shù)據(jù)獲取。并且,在獲得網(wǎng)頁后,由于微博的網(wǎng)頁結(jié)構(gòu)單一,本文設(shè)計(jì)了相應(yīng)的去噪、解析方法,即通過固定特征值直接定位有效信息,實(shí)現(xiàn)高效解析。對得到的數(shù)據(jù),對其進(jìn)行簡單的數(shù)據(jù)分析,得到一些簡單有意思的信息。 實(shí)驗(yàn)結(jié)果表明該方法具有通用性強(qiáng)、完全無需人工干預(yù)、獲取信息的質(zhì)量高、速度快等優(yōu)點(diǎn)。
[Abstract]:With the maturity of the Internet and the rapid development of mobile Internet, more and more information are published on the network, and this way is gradually accepted by the public.The information on the network can reflect the public intention to some extent, but at the same time some demagoguery words can also incite the netizen, so the network public opinion is paid more and more attention in the present society.In order to develop a healthy network environment, relevant government departments need to make effective prediction, discovery and guidance of network public opinion.In the field of network public opinion, Weibo's public opinion is concerned, because more and more public opinion events are first exposed on Weibo, and then spread on Weibo to discuss the formation of public opinion events.From all levels of government, enterprises and institutions to open Weibo's actions can see the status of Weibo in the network.This paper analyzes and studies some problems existing in data acquisition in Weibo's public opinion system, and puts forward the idea of collecting web pages by simulating login, and then using priority queue to collect the more influential Weibo on Weibo.The main work of this paper is as follows:This paper analyzes three methods used at present: Weibo push, Weibo API and web crawler.The first two methods are difficult to meet the demand of the public opinion system for Weibo data in scale and real-time. The last one is not easy to collect useful information.For this reason, this paper proposes a method of imitating browser login Weibo to grab web page data, so as to obtain data on any user's page easily, and to avoid the limitation of data acquisition speed of the former two methods.Considering Weibo's large number of users, many users will be left out when collecting data.This paper puts forward the method of constructing Weibo user network to solve this problem.First of all, each Weibo user is abstracted as a point, the relationship between user and user, attention, forwarding, comment and so on are abstracted as edges, and the quantization value of each relationship is regarded as the corresponding relation weight value of each kind of relationship.By adding dots and edges, we can construct a huge Weibo user network, which can continuously discover new Weibo users and ensure the integrity of the data.In order to achieve efficient Weibo data acquisition, priority queue algorithm is adopted in this paper.Efficient data acquisition means that in the face of a large number of data, we collect these data at different levels, that is to say, we first collect Weibo, who has great influence, and then we have less influence.In order to realize this function, the priority calculation model is designed in this paper.Synthesizing Sina Weibo's definition of influential user and all kinds of actual situation, the five factors of fan number, attention number, activity degree, propagation power and time stamp are screened out.With the influence as the main factor, the priority queue is constructed, which makes the more influential user data acquisition frequency higher, but also through calculating the time interval to take account of inactive users data acquisition.After obtaining the web page, due to the single structure of Weibo's web page, the corresponding denoising and parsing method is designed in this paper, that is, the effective information can be directly located by fixed eigenvalues to achieve efficient parsing.For the obtained data, the simple data analysis, get some simple and interesting information.The experimental results show that this method has many advantages, such as high quality and high speed.
【學(xué)位授予單位】:湘潭大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 唐開山;基于K叉樹的優(yōu)先隊(duì)列[J];系統(tǒng)工程理論與實(shí)踐;1999年07期

2 劉晨亮 ,許家棟,楊少軍;常量時(shí)間的優(yōu)先隊(duì)列算法[J];微型機(jī)與應(yīng)用;2004年05期

3 王兆紅;利用堆實(shí)現(xiàn)優(yōu)先隊(duì)列[J];電腦學(xué)習(xí);2005年06期

4 范中,鄭應(yīng)平;優(yōu)先隊(duì)列控制模型參數(shù)優(yōu)化[J];電子學(xué)報(bào);1998年08期

5 林家驥,閔應(yīng)驊;一種基于類的優(yōu)先隊(duì)列的動(dòng)態(tài)資源配置方案[J];科學(xué)技術(shù)與工程;2005年14期

6 王知人,王平;優(yōu)先隊(duì)列控制算法的性能研究[J];自動(dòng)化技術(shù)與應(yīng)用;2000年05期

7 崔慎智;陳志泊;;基于多代理和多優(yōu)先隊(duì)列的短信實(shí)時(shí)并發(fā)算法[J];計(jì)算機(jī)工程;2011年03期

8 武繼剛,陳國良;優(yōu)先隊(duì)列與并行分枝界限算法[J];煙臺(tái)大學(xué)學(xué)報(bào)(自然科學(xué)與工程版);2000年01期

9 劉晨亮,許家棟,楊少軍;大排隊(duì)長度優(yōu)化的優(yōu)先隊(duì)列算法[J];計(jì)算機(jī)應(yīng)用;2004年S1期

10 劉晨亮,許家棟,李前進(jìn);基于基數(shù)排序的集成服務(wù)優(yōu)先隊(duì)列算法[J];計(jì)算機(jī)工程與應(yīng)用;2004年27期

相關(guān)會(huì)議論文 前1條

1 范中;鄭應(yīng)平;;優(yōu)先隊(duì)列控制模型優(yōu)化參數(shù)設(shè)計(jì)[A];1997年中國控制會(huì)議論文集[C];1997年



本文編號:1770288

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/1770288.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶7b424***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com