基于海量查詢?nèi)罩镜臄?shù)據(jù)挖掘及用戶行為分析

發(fā)布時(shí)間：2018-03-22 13:04

本文選題：海量日志　切入點(diǎn)：數(shù)據(jù)挖掘　出處：《北京郵電大學(xué)》2013年碩士論文　論文類型：學(xué)位論文

【摘要】：隨著互聯(lián)網(wǎng)和搜索引擎技術(shù)的飛速發(fā)展,Web中包含的信息不斷增加,搜索引擎成為大多數(shù)用戶為獲取網(wǎng)絡(luò)信息的首選。在用戶與搜索引擎的交互過(guò)程中,產(chǎn)生了海量的查詢?nèi)罩?而且這些日志還在不斷地增長(zhǎng)。由于日志中蘊(yùn)含了大量和用戶相關(guān)的信息,成為很多公司為更好地了解并吸引更多用戶的重點(diǎn)研究對(duì)象。利用分布式技術(shù)存儲(chǔ)并計(jì)算海量日志,使得對(duì)查詢?nèi)罩镜难芯孔兊酶臃奖�。如今各大互�?lián)網(wǎng)公司都越來(lái)越重視自己的查詢?nèi)罩?期望通過(guò)對(duì)這些日志進(jìn)行及時(shí)、精確地分析和挖掘來(lái)發(fā)現(xiàn)隱藏在日志中的用戶行為特征,以此來(lái)提高用戶使用搜索引擎時(shí)的滿意度,提升企業(yè)的市場(chǎng)競(jìng)爭(zhēng)力。本文以海量查詢?nèi)罩咀鳛樘幚韺?duì)象,主要進(jìn)行的工作有： (1)對(duì)日志預(yù)處理技術(shù)的研究。主要研究了數(shù)據(jù)清洗、用戶識(shí)別、會(huì)話識(shí)別、路徑補(bǔ)充和事務(wù)識(shí)別以及相關(guān)算法,并將分布式技術(shù)和算法相結(jié)合,實(shí)現(xiàn)了基于Hadoop的日志預(yù)處理過(guò)程,為后面數(shù)據(jù)挖掘做準(zhǔn)備。 (2)設(shè)計(jì)用戶日志挖掘系統(tǒng)�？紤]到日志海量的特點(diǎn),傳統(tǒng)的數(shù)據(jù)存儲(chǔ)和計(jì)算方法難以適用于搜索引擎用戶行為分析中。針對(duì)此問(wèn)題,本文提出基于MapReduce編程框架對(duì)海量日志進(jìn)行挖掘的思想,根據(jù)日志中記錄的用戶查詢?cè)~、點(diǎn)擊的URL和標(biāo)識(shí)用戶身份的ID對(duì)用戶行為進(jìn)行建模,將用戶行為用特征向量來(lái)表示,給出不同用戶相似度的計(jì)算公式,分析了K-means算法分布式化的可行性并給出詳細(xì)的分布式實(shí)踐步驟。實(shí)驗(yàn)證明,該算法能夠有效的對(duì)用戶聚類,并在處理海量數(shù)據(jù)時(shí)表現(xiàn)出較好的性能。 (3)對(duì)用戶行為進(jìn)行分析。主要分析了日志量、用戶量及兩者的關(guān)系；用戶查詢?cè)~的數(shù)量、長(zhǎng)度、字符組成、常用查詢?cè)~；被點(diǎn)擊的URL總量、URL的深度、常用URL；搜索引擎返回結(jié)果的順序與用戶點(diǎn)擊的順序之間的關(guān)系。經(jīng)過(guò)對(duì)日志的多角度分析,得出用戶行為的特征,從而為以后改善搜索引擎和用戶之間的交互體驗(yàn)提供參考依據(jù)。
[Abstract]:With the rapid development of the Internet and search engine technology, the information contained in the Web is increasing, and the search engine has become the first choice for most users to obtain network information. In the process of interaction between users and search engines, massive query logs have been generated. And these logs are growing. Because they contain a lot of user-related information, they have become the focus of many companies to better understand and attract more users. It makes the research of query logs more convenient. Nowadays, all the major Internet companies are paying more and more attention to their own query logs, hoping to make these logs in a timely manner. In order to improve the users' satisfaction in using search engine and enhance the market competitiveness of enterprises, the user behavior characteristics hidden in the log are analyzed and mined accurately. This paper takes the massive query log as the processing object. The main work of this paper is as follows:. This paper mainly studies data cleaning, user identification, session identification, path complement, transaction identification and related algorithms, and combines distributed technology with algorithms. The process of log preprocessing based on Hadoop is implemented to prepare for data mining. 2) designing user log mining system. Considering the huge amount of logs, the traditional data storage and computing methods are difficult to be used in the behavior analysis of search engine users. In this paper, the idea of mining massive logs based on MapReduce programming framework is proposed. According to the user query words recorded in the log, the clicked URL and the ID identifying the user identity, the user behavior is modeled, and the user behavior is represented by the feature vector. The calculation formulas of different user similarity are given, the feasibility of distributed K-means algorithm is analyzed, and the detailed distributed practical steps are given. The experimental results show that the algorithm can effectively cluster users. And show good performance when dealing with massive data. Analysis of user behavior. This paper mainly analyzes the number of logs, the number of users and their relationship; the number, length, character composition, common query words of user query words; the total number of URLs clicked and the depth of URLs. The relationship between the order of the results returned by the search engine and the order in which the user clicks. Through the multi-angle analysis of the log, the characteristics of the user's behavior are obtained. So as to improve the interaction between search engines and users in the future to provide a reference basis.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3;TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 王嵐,張鵬祥;基于Web的數(shù)據(jù)挖掘研究[J];長(zhǎng)春師范學(xué)院學(xué)報(bào);2005年07期

2 孫健;賈曉菁;;Google云計(jì)算平臺(tái)的技術(shù)架構(gòu)及對(duì)其成本的影響研究[J];電信科學(xué);2010年01期

3 王建勇,單松巍,雷鳴,謝正茂,李曉明;海量Web搜索引擎系統(tǒng)中用戶行為的分布特征及其啟示[J];中國(guó)科學(xué)E輯:技術(shù)科學(xué);2001年04期

4 王繼成,潘金貴,張福炎;Web文本挖掘技術(shù)研究[J];計(jì)算機(jī)研究與發(fā)展;2000年05期

5 宋擒豹,沈鈞毅;Web日志的高效多能挖掘算法[J];計(jì)算機(jī)研究與發(fā)展;2001年03期

6 董一鴻,莊越挺;基于新型的競(jìng)爭(zhēng)型神經(jīng)網(wǎng)絡(luò)的Web日志挖掘[J];計(jì)算機(jī)研究與發(fā)展;2003年05期

7 張慧穎,梁偉;基于用戶訪問(wèn)模式挖掘的網(wǎng)頁(yè)實(shí)時(shí)推薦研究[J];計(jì)算機(jī)應(yīng)用;2004年06期

8 勾海波;歐陽(yáng)為民;徐春榮;;搜索引擎查詢?nèi)罩局械木垲愃惴ㄑ芯縖J];計(jì)算機(jī)應(yīng)用與軟件;2007年03期

9 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學(xué)報(bào);2007年01期

10 甘利人;岑詠華;李恒;;基于三階段過(guò)程的信息搜索影響因素分析[J];圖書(shū)情報(bào)工作;2007年02期

相關(guān)碩士學(xué)位論文前3條

1 紀(jì)俊;一種基于云計(jì)算的數(shù)據(jù)挖掘平臺(tái)架構(gòu)設(shè)計(jì)與實(shí)現(xiàn)[D];青島大學(xué);2009年

2 陳勇;基于Hadoop平臺(tái)的通信數(shù)據(jù)分布式查詢算法的設(shè)計(jì)與實(shí)現(xiàn)[D];北京交通大學(xué);2009年

3 鄧自立;云計(jì)算中的網(wǎng)絡(luò)拓?fù)湓O(shè)計(jì)和Hadoop平臺(tái)研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2009年

，

本文編號(hào)：1648770

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1648770.html

上一篇：實(shí)時(shí)搜索引擎中時(shí)間信息的獲取及簡(jiǎn)單應(yīng)用
下一篇：科研管理中文獻(xiàn)自動(dòng)跟蹤系統(tǒng)的開(kāi)發(fā)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于海量查詢?nèi)罩镜臄?shù)據(jù)挖掘及用戶行為分析