當(dāng)前位置：主頁 > 經(jīng)濟(jì)論文 > 技術(shù)經(jīng)濟(jì)論文 >

基于智能網(wǎng)關(guān)的用戶Web信息采集與分析系統(tǒng)

發(fā)布時間：2018-05-08 12:50

本文選題：Web信息采集 + 關(guān)鍵詞提取��；參考：《山東大學(xué)》2016年碩士論文

【摘要】：信息時代的到來使互聯(lián)網(wǎng)成為個人及家庭最重要的信息來源,越來越多的用戶通過各種智能終端設(shè)備接入互聯(lián)網(wǎng),這種信息獲取和交流的方式已逐漸成為當(dāng)今時代的主流。緊隨而來的各種快捷便利的服務(wù)軟件使各大互聯(lián)網(wǎng)公司逐漸意識到用戶信息作為一種戰(zhàn)略資產(chǎn)具有極高的經(jīng)濟(jì)價值。因此,把握海量數(shù)據(jù)背景下的用戶Web信息,分析用戶行為習(xí)慣無論是對學(xué)術(shù)研究的推動還是對企業(yè)客戶資源的維系和發(fā)展都是具有著重要的意義。目前,分析用戶行為的數(shù)據(jù)主要來源是服務(wù)器用戶日志和瀏覽器cookie。前者是用戶訪問目標(biāo)網(wǎng)站時,網(wǎng)站記錄用戶相關(guān)行為,按特定格式生成服務(wù)器日志；后者則通過網(wǎng)站上加掛的腳本將用戶信息發(fā)送給后臺服務(wù)器端。這兩種方法都依賴特定的網(wǎng)站,比較理想的情況是用戶訪問不同網(wǎng)站時都能拿到用戶的訪問數(shù)據(jù),而路由器作為家庭網(wǎng)絡(luò)鏈接和數(shù)據(jù)分發(fā)的中心,在家庭組網(wǎng)中占據(jù)著至關(guān)重要的位置。針對路由器的這種優(yōu)勢,本論文設(shè)計并實現(xiàn)了一種基于智能路由器的用戶Web信息采集和分析系統(tǒng),重點解決了用戶信息采集方式的局限性和采集信息的片面性問題。該系統(tǒng)分為網(wǎng)關(guān)和后臺兩部分,網(wǎng)關(guān)側(cè)完成用戶ID和瀏覽網(wǎng)址的提取與傳輸,后臺服務(wù)器接收網(wǎng)關(guān)側(cè)采集的數(shù)據(jù)后,主要完成相應(yīng)Web界面的正文和關(guān)鍵詞的提取、頁面瀏覽時間統(tǒng)計、子鏈接爬取與相關(guān)度計算以及文本主題分類等信息的采集與分析。本論文創(chuàng)新點主要包括以下五個方面：(1)分析了系統(tǒng)應(yīng)用的特有環(huán)境要求和應(yīng)用場景,結(jié)合新聞主題類和商品購物類網(wǎng)站的網(wǎng)頁結(jié)構(gòu)特點,提出了文本密度與多特征值相結(jié)合的Web正文抽取算法,既提高了網(wǎng)頁正文的抽取速度又保證了抽取的準(zhǔn)確率。(2)提出一種基于統(tǒng)計、結(jié)構(gòu)、語言分析相結(jié)合的TF-IDF文本關(guān)鍵詞提取算法,該算法考慮了詞長、詞跨度等特征對關(guān)鍵詞提取的影響,克服了傳統(tǒng)TF-IDF提取算法完全基于詞頻統(tǒng)計的缺陷。(3)設(shè)計了一種網(wǎng)絡(luò)爬蟲的主題爬取策略,基于提出的文本關(guān)鍵詞提取算法和VSM文本相似度計量原理,實現(xiàn)了兩層網(wǎng)頁的子鏈接爬取與相關(guān)度計算。(4)提出一種卡方值加權(quán)的貝葉斯分類算法,該算法更加強(qiáng)調(diào)在文本分類過程中類別與特征之間的相關(guān)性關(guān)系,提高了文本分類的準(zhǔn)確率。(5)提出一套用戶Web信息采集與分析系統(tǒng)的整體設(shè)計方案,并通過編寫程序完成整個系統(tǒng)實現(xiàn),最后在基于OpenWrt智能路由的家庭局域網(wǎng)內(nèi)測試了該方案的可行性。
[Abstract]:With the advent of the information age, the Internet has become the most important source of information for individuals and families. More and more users connect to the Internet through various intelligent terminal devices. This way of information acquisition and communication has gradually become the mainstream of the times. All kinds of fast and convenient service software make the major Internet companies realize that user information has high economic value as a strategic asset. Therefore, it is of great significance to grasp the user Web information under the background of massive data and analyze the behavior habits of users, whether it is the promotion of academic research or the maintenance and development of enterprise customer resources. At present, the main sources of data for analyzing user behavior are server user log and browser cookie. The former is when the user visits the target website, the website records the user's related behavior and generates the server log according to the specific format; the latter sends the user information to the background server through the script added on the website. Both approaches rely on specific sites, ideally where users can access data when they visit different sites, while routers act as a hub for home network links and data distribution. In the home network occupies the vital position. Aiming at the advantages of routers, this paper designs and implements a user Web information acquisition and analysis system based on intelligent router, which focuses on solving the limitation of user information collection and the one-sidedness of collecting information. The system is divided into two parts: gateway and background. The gateway side completes the extraction and transmission of user ID and browsing web site. After receiving the data collected from the gateway side, the background server mainly completes the extraction of the text and key words of the corresponding Web interface. Page browsing time statistics, sub-link crawling and correlation calculation, text topic classification and other information collection and analysis. The innovation of this paper mainly includes the following five aspects: 1) analyzing the special environmental requirements and application scenarios of the system application, combining the web structure characteristics of the news subject category and the commodity shopping website. In this paper, a Web text extraction algorithm combining text density with multiple eigenvalues is proposed, which not only improves the extraction speed of web pages, but also ensures the accuracy of extraction. This algorithm combines language analysis with TF-IDF text keyword extraction algorithm, which takes into account the influence of word length, word span and other features on keyword extraction. This paper overcomes the shortcoming of traditional TF-IDF extraction algorithm based entirely on word frequency statistics. It designs a topic crawling strategy for web crawlers, based on the proposed text keyword extraction algorithm and the principle of VSM text similarity measurement. In this paper, we implement sub-link crawling and correlation calculation of two-layer web pages. We propose a chi-square weighted Bayesian classification algorithm, which emphasizes the correlation between category and feature in the process of text classification. Improve the accuracy of text classification. (5) put forward a set of user Web information collection and analysis system overall design scheme, and complete the whole system by writing a program. Finally, the feasibility of the scheme is tested in the home LAN based on OpenWrt intelligent routing.
【學(xué)位授予單位】：山東大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP274

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 鄧福慶;信息理論與實踐的新成果──簡評《信息采集》[J];求是學(xué)刊;1996年01期

2 劉燕德;周衍華;趙文星;劉德力;;數(shù)字化果園信息采集方法的研究進(jìn)展[J];中國農(nóng)機(jī)化學(xué)報;2014年02期

3 陳予雯;;信用體系下的個人信息采集與共享新探[J];內(nèi)江科技;2006年07期

4 劉家真;許潔;;建立基于共享的政務(wù)信息采集機(jī)制的對策建議[J];信息化建設(shè);2007年07期

5 姜麗華;張宏斌;;基于Agent的個性化信息采集與處理系統(tǒng)[J];農(nóng)業(yè)網(wǎng)絡(luò)信息;2007年07期

6 王嵩;王兵;;鐵路集裝箱運輸信息采集的研究與設(shè)計[J];鐵路計算機(jī)應(yīng)用;2008年07期

7 喻國明;;中國媒體奧運報道該打“高分”[J];新聞與寫作;2008年09期

8 孫曉明;;艦船數(shù)據(jù)信息采集方法研究[J];科技信息;2009年12期

9 單杰;;淺談農(nóng)業(yè)信息采集與開發(fā)[J];黑龍江科技信息;2009年30期

10 李雪竹;宋子?xùn)|;;信息采集協(xié)議的時效性分析[J];宿州學(xué)院學(xué)報;2013年03期

相關(guān)會議論文前10條

1 李靜;張建;李淼;胡澤林;楊巍;張浩東;;便攜式農(nóng)田信息采集與管理系統(tǒng)的設(shè)計[A];紀(jì)念中國農(nóng)業(yè)工程學(xué)會成立30周年暨中國農(nóng)業(yè)工程學(xué)會2009年學(xué)術(shù)年會（CSAE 2009）論文集[C];2009年

2 譚亮;王榮成;;基于船舶網(wǎng)絡(luò)的信息采集性能分析與系統(tǒng)優(yōu)化[A];2008年MIS/S&A學(xué)術(shù)交流會議論文集[C];2008年

3 周洪清;;客戶動銷信息采集工作的思考[A];湖北省煙草學(xué)會2007年學(xué)術(shù)年會論文集[C];2007年

4 劉麗麗;;一種用于加油站信息采集的中控機(jī)設(shè)計[A];2007年河北省電子學(xué)會、河北省計算機(jī)學(xué)會、河北省自動化學(xué)會、河北省人工智能學(xué)會、河北省計算機(jī)輔助設(shè)計研究會、河北省軟件行業(yè)協(xié)會聯(lián)合學(xué)術(shù)年會論文集[C];2007年

5 蔡義華;劉剛;;便攜式農(nóng)田信息采集與無線傳輸系統(tǒng)研究[A];紀(jì)念中國農(nóng)業(yè)工程學(xué)會成立30周年暨中國農(nóng)業(yè)工程學(xué)會2009年學(xué)術(shù)年會（CSAE 2009）論文集[C];2009年

6 王坤;郭起云;郭光;;大數(shù)據(jù)時代檔案信息采集新思路[A];2013年海峽兩岸檔案暨縮微學(xué)術(shù)交流會論文集[C];2013年

7 羅海勇;李錦濤;趙方;朱珍民;林權(quán);;溫室無線測控網(wǎng)絡(luò)信息采集分系統(tǒng)設(shè)計研究[A];2007年全國開放式分布與并行計算機(jī)學(xué)術(shù)會議論文集(下冊)[C];2007年

8 陳渝光;施海;游四海;廖仕利;;基于車載網(wǎng)絡(luò)的多模態(tài)信息采集[A];四川省電工技術(shù)學(xué)會第九屆學(xué)術(shù)年會論文集[C];2008年

9 高錦超;柯賡;;基于網(wǎng)絡(luò)的地理信息采集和管理系統(tǒng)[A];'2005數(shù)字江蘇論壇電子政務(wù)與地理信息技術(shù)論文專輯[C];2005年

10 陳凌;張陽陽;陳宏;劉紅漫;;《電光與控制》發(fā)行工作中的問題及解決辦法[A];第7屆中國科技期刊青年編輯學(xué)術(shù)研討會暨中國科技期刊的經(jīng)營與發(fā)展論壇文集[C];2007年

相關(guān)重要報紙文章前10條

1 李琳盧慶紅;縉云聘請百名“蜜蜂”式信息采集員[N];麗水日報;2007年

2 沈雪;福泉把派出所基礎(chǔ)信息采集納入“政府工程”[N];人民公安報;2007年

3 劉明軍　夏俊濤;河南潢川：種好基礎(chǔ)信息采集“試驗田”[N];人民公安報;2009年

4 王宇航邋陳建琦;云南蒙自：標(biāo)準(zhǔn)化信息采集室實現(xiàn)“無縫隙”覆蓋[N];人民公安報;2008年

5 記者陳磊;陜西年內(nèi)將建成500個標(biāo)準(zhǔn)化刑偵信息采集室[N];人民公安報;2010年

6 記者阮仕喜;陜西上半年將建成500個刑偵信息采集室[N];人民公安報;2011年

7 何英彩;冊亨公安提前二月完成基礎(chǔ)信息采集錄入任務(wù)[N];黔西南日報;2008年

8 吳蘇楊一弘;推進(jìn)信息采集筑牢基礎(chǔ)環(huán)節(jié)[N];黑龍江經(jīng)濟(jì)報;2010年

9 李曉楠、楊勇;8650部隊信息采集員制度拓寬民主渠道[N];人民武警報;2011年

10 本報首席記者彭文輝本報記者方曉;信息采集：“三網(wǎng)”建設(shè)取得實效的生命線[N];宜春日報;2013年

相關(guān)博士學(xué)位論文前2條

1 許笑;分布式Web信息采集關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2011年

2 賈自艷;Web信息智能獲取若干關(guān)鍵問題研究[D];中國科學(xué)院研究生院（計算技術(shù)研究所）;2004年

相關(guān)碩士學(xué)位論文前10條

1 歐莉;能源在線監(jiān)測系統(tǒng)中電力信息采集器的設(shè)計與實現(xiàn)[D];東華理工大學(xué);2016年

2 彭壽鈞;基于智能網(wǎng)關(guān)的用戶Web信息采集與分析系統(tǒng)[D];山東大學(xué);2016年

3 俞浩亮;互聯(lián)網(wǎng)不良信息采集抽取及識別技術(shù)研究[D];昆明理工大學(xué);2016年

4 馮乙新;智能交通氣象信息精細(xì)化監(jiān)測系統(tǒng)設(shè)計[D];南京信息工程大學(xué);2016年

5 司晨;城市規(guī)劃管理信息采集的問題與對策研究[D];云南大學(xué);2016年

6 周文杰;基于iOS的棉蚜蟲害信息采集與主動服務(wù)系統(tǒng)研發(fā)[D];石河子大學(xué);2016年

7 楊凡;面向移動設(shè)備的信息采集和處理研究與實現(xiàn)[D];西北大學(xué);2012年

8 董飛;用電信息采集一體化建設(shè)研究[D];大連海事大學(xué);2011年

9 張巧珍;基于價值鏈的企業(yè)信息采集研究[D];華中師范大學(xué);2013年

10 王凌霄;身份證閱讀器信息采集與處理系統(tǒng)的設(shè)計與實現(xiàn)[D];華中科技大學(xué);2012年

，

本文編號：1861475

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/jingjilunwen/jiliangjingjilunwen/1861475.html

上一篇：CAS39號文件實施后公允價值分層計量的價值相關(guān)性研究
下一篇：金融創(chuàng)新對貨幣與產(chǎn)出關(guān)系影響的實證研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于智能網(wǎng)關(guān)的用戶Web信息采集與分析系統(tǒng)