互聯(lián)網業(yè)務重組與內容提取

發(fā)布時間：2019-01-15 07:25

【摘要】：互聯(lián)網的迅猛發(fā)展帶動了網絡應用的快速增長,互聯(lián)網為用戶提供了種類繁多的網絡業(yè)務,并不斷滿足網絡用戶的各種需求。每天都會產生海量的數(shù)據(jù)信息,過濾不良信息,篩選有用的信息,具有重要的研究價值與工程意義。本文致力于網絡應用的業(yè)務重組與內容提取的研究與實現(xiàn),主要工作內容包括三個部分,網絡業(yè)務重組設計與實現(xiàn)、基于正則表達式的論壇社區(qū)應用的內容提取與安全審計、基于DOM樹的網頁內容提取與分析。本文首先介紹了HTML語言、DOM模型以及涉及到的報文采集技術,數(shù)據(jù)包重組技術等關鍵技術。其次,設計與實現(xiàn)了網絡業(yè)務重組過程,其中介紹了數(shù)據(jù)包重組過程,并使用了libnids開源庫實現(xiàn)了TCP會話重組,并對HTTP數(shù)據(jù)進行了壓縮解碼與塊解碼,得到了web頁面。再次,采集幾十種熱門論壇通信數(shù)據(jù),通過分析得到了幾種常用的論壇通用系統(tǒng),并提取了論壇識別特征,提出了論壇指紋概念,優(yōu)化了傳統(tǒng)的論壇審計方法。最后,結合網頁特點與提取信息的特征,提出了基于DOM的網頁提取方法：對網頁進行預處理,選擇標簽作為網頁提取特征,通過構建DOM樹,實現(xiàn)了對網頁內容的快速提取。通過這個方法完成了網絡辦公管理服務系統(tǒng)的軟件版本跟蹤模塊,并分析了網頁特征提取方法與網頁特點。
[Abstract]:With the rapid development of the Internet, the rapid growth of network applications, the Internet provides users with a wide variety of network services, and constantly meet the needs of network users. It has important research value and engineering significance to produce massive data information, filter bad information and filter useful information every day. This paper is devoted to the research and implementation of business reorganization and content extraction of network application. The main work includes three parts: design and implementation of network business reorganization, content extraction and security audit of forum community application based on regular expression. Web content extraction and analysis based on DOM tree. This paper first introduces the HTML language, DOM model, packet collection technology, packet recombination technology and other key technologies. Secondly, this paper designs and implements the process of network business reorganization, which introduces the process of packet recombination, and uses libnids open source library to realize TCP session reconfiguration. The HTTP data is compressed and decoded, and the web page is obtained. Thirdly, through the analysis of dozens of popular forum communication data, several common forum systems are obtained, and the forum identification features are extracted, the concept of forum fingerprint is proposed, and the traditional forum auditing method is optimized. Finally, combining the characteristics of web pages and the features of extracting information, a method of web page extraction based on DOM is put forward: preprocessing the web pages, selecting tags as the feature of page extraction, and constructing the DOM tree to quickly extract the content of the web pages. Through this method, the software version tracking module of the network office management service system is completed, and the method of feature extraction and the feature of the web page are analyzed.
【學位授予單位】：北京郵電大學
【學位級別】：碩士
【學位授予年份】：2014
【分類號】：TP393.092

【參考文獻】

相關期刊論文前4條

1 溫曙光;謝高崗;;libpcap-MT:一種多線程的通用數(shù)據(jù)包捕獲庫[J];計算機研究與發(fā)展;2011年05期

2 馬如林;蔣華;張慶霞;;一種哈希表快速查找的改進方法[J];計算機工程與科學;2008年09期

3 姚光開,于永棠,柴喬林;微型TCP/IP協(xié)議棧的設計與實現(xiàn)[J];計算機應用;2003年09期

4 林延福,楊新旭,李學干;網絡內容審計及其關鍵技術的研究[J];現(xiàn)代電子技術;2005年02期

，

本文編號：2408982

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/guanlilunwen/ydhl/2408982.html

上一篇：網絡時間隱蔽通道的擬合模型特性研究
下一篇：IP定位技術的研究

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

互聯(lián)網業(yè)務重組與內容提取