網(wǎng)絡(luò)內(nèi)容過濾系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-11-03 17:23
【摘要】:校園網(wǎng)給師生提供便利的同時(shí)也帶來了危害,大量不健康和無用的信息充斥著網(wǎng)絡(luò)世界,給高校校園網(wǎng)的管理和維護(hù)帶來了很大的挑戰(zhàn)。網(wǎng)絡(luò)內(nèi)容過濾是一種有效的應(yīng)對(duì)方法,能夠自動(dòng)地將網(wǎng)絡(luò)中特定的信息過濾掉。本文首先回顧了國(guó)內(nèi)外網(wǎng)絡(luò)過濾領(lǐng)域的發(fā)展現(xiàn)狀、存在的問題以及常見的過濾方法。本系統(tǒng)實(shí)現(xiàn)了兩個(gè)關(guān)鍵的系統(tǒng)功能模塊:網(wǎng)絡(luò)數(shù)據(jù)包的捕獲和重組模塊、網(wǎng)絡(luò)文本數(shù)據(jù)處理模塊。文中完成了網(wǎng)絡(luò)內(nèi)容過濾系統(tǒng)兩大關(guān)鍵功能:實(shí)現(xiàn)對(duì)特定URL的過濾以及對(duì)網(wǎng)頁正文內(nèi)容的過濾,其中網(wǎng)頁正文是文本內(nèi)容,不包括圖像視頻等多媒體信息。網(wǎng)絡(luò)數(shù)據(jù)捕獲模塊主要研究分析了網(wǎng)絡(luò)協(xié)議的解析,在具體的分析過程中涉及到以太網(wǎng)數(shù)據(jù)幀、IP數(shù)據(jù)包、TCP數(shù)據(jù)段和HTTP報(bào)文,同時(shí)在基于網(wǎng)絡(luò)協(xié)議分析的基礎(chǔ)上完成了在Windows系統(tǒng)下利用網(wǎng)絡(luò)數(shù)據(jù)包捕獲庫(kù)Winpcap對(duì)網(wǎng)絡(luò)數(shù)據(jù)包的捕獲和分析,最終這個(gè)模塊實(shí)現(xiàn)了URL過濾功能和HTML的頁面重組,為文本數(shù)據(jù)處理模塊提供了文本數(shù)據(jù)。根據(jù)校園網(wǎng)的特點(diǎn),URL過濾功能中的URL過濾庫(kù)可以由自行定義的多個(gè)不同規(guī)則庫(kù)組成,并且根據(jù)不同時(shí)間段運(yùn)行不同的過濾規(guī)則庫(kù)。網(wǎng)絡(luò)文本數(shù)據(jù)處理模塊研究了網(wǎng)頁文本分類技術(shù)。因?yàn)榫W(wǎng)頁文本是一種半結(jié)構(gòu)化的文本數(shù)據(jù),首先研究和實(shí)現(xiàn)了從網(wǎng)頁文本中提取文本數(shù)據(jù)。然后重點(diǎn)研究了文本分類技術(shù),主要包括文本預(yù)處理和文本分類器的訓(xùn)練兩大技術(shù)難點(diǎn)。文本預(yù)處理技術(shù)中還涉及到中文分詞、特征選擇和權(quán)重計(jì)算等技術(shù)。對(duì)現(xiàn)在主流的各種文本分類器進(jìn)行了理論上的分析和比較,最終根據(jù)校園網(wǎng)的特點(diǎn)選擇了類中心向量分類器作為文本分類器。根據(jù)訓(xùn)練集文本完成文本分類器的學(xué)習(xí),并對(duì)分類器的效果進(jìn)行了交叉驗(yàn)證測(cè)試,取得了較滿意的分類結(jié)果。最后對(duì)網(wǎng)絡(luò)內(nèi)容過濾系統(tǒng)進(jìn)行了總結(jié)和展望。希望下一步工作可以實(shí)現(xiàn)更加全面的網(wǎng)絡(luò)內(nèi)容過濾系統(tǒng),不僅僅是文本內(nèi)容,還可以包括圖片、聲音和視頻等多媒體信息的過濾。
[Abstract]:Campus network not only provides convenience to teachers and students but also brings harm. A large number of unhealthy and useless information flooded the network world and brought great challenges to the management and maintenance of campus network in colleges and universities. Web content filtering is an effective response method, which can automatically filter out the specific information in the network. Firstly, this paper reviews the status quo, existing problems and common filtering methods in the field of network filtering at home and abroad. This system realizes two key function modules: network data packet capture and recombination module, network text data processing module. In this paper, two key functions of the network content filtering system are accomplished: filtering the specific URL and filtering the content of the text of the web page. The text of the web page is the text content, not the multimedia information such as image and video. The network data capture module mainly studies and analyzes the analysis of network protocol, which involves Ethernet data frame, IP data packet, TCP data segment and HTTP message. At the same time, on the basis of network protocol analysis, the capture and analysis of network data packets using network packet capture library (Winpcap) under Windows system is completed. Finally, this module realizes the function of URL filtering and the page recombination of HTML. Provides text data for text data processing module. According to the characteristics of campus network, the URL filter library in the URL filtering function can be composed of several different rule libraries defined by itself, and run different filtering rule libraries according to different time periods. Web text data processing module studies the technology of web page text classification. Because web text is a kind of semi-structured text data, firstly, we study and realize extracting text data from web text. Then it focuses on the text classification technology, including the text preprocessing and text classifier training two major technical difficulties. Chinese word segmentation, feature selection and weight calculation are also involved in text preprocessing. This paper analyzes and compares all kinds of mainstream text classifiers in theory, and finally selects class center vector classifier as text classifier according to the characteristics of campus network. According to the text of the training set, the text classifier is learned, and the effect of the classifier is tested by cross-validation, and satisfactory results are obtained. Finally, the network content filtering system is summarized and prospected. It is hoped that the next step will be to implement a more comprehensive network content filtering system, not only for text content, but also for the filtering of multimedia information, such as pictures, sounds and videos.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.08
本文編號(hào):2308442
[Abstract]:Campus network not only provides convenience to teachers and students but also brings harm. A large number of unhealthy and useless information flooded the network world and brought great challenges to the management and maintenance of campus network in colleges and universities. Web content filtering is an effective response method, which can automatically filter out the specific information in the network. Firstly, this paper reviews the status quo, existing problems and common filtering methods in the field of network filtering at home and abroad. This system realizes two key function modules: network data packet capture and recombination module, network text data processing module. In this paper, two key functions of the network content filtering system are accomplished: filtering the specific URL and filtering the content of the text of the web page. The text of the web page is the text content, not the multimedia information such as image and video. The network data capture module mainly studies and analyzes the analysis of network protocol, which involves Ethernet data frame, IP data packet, TCP data segment and HTTP message. At the same time, on the basis of network protocol analysis, the capture and analysis of network data packets using network packet capture library (Winpcap) under Windows system is completed. Finally, this module realizes the function of URL filtering and the page recombination of HTML. Provides text data for text data processing module. According to the characteristics of campus network, the URL filter library in the URL filtering function can be composed of several different rule libraries defined by itself, and run different filtering rule libraries according to different time periods. Web text data processing module studies the technology of web page text classification. Because web text is a kind of semi-structured text data, firstly, we study and realize extracting text data from web text. Then it focuses on the text classification technology, including the text preprocessing and text classifier training two major technical difficulties. Chinese word segmentation, feature selection and weight calculation are also involved in text preprocessing. This paper analyzes and compares all kinds of mainstream text classifiers in theory, and finally selects class center vector classifier as text classifier according to the characteristics of campus network. According to the text of the training set, the text classifier is learned, and the effect of the classifier is tested by cross-validation, and satisfactory results are obtained. Finally, the network content filtering system is summarized and prospected. It is hoped that the next step will be to implement a more comprehensive network content filtering system, not only for text content, but also for the filtering of multimedia information, such as pictures, sounds and videos.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.08
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 張莉,曾致遠(yuǎn);Windows下網(wǎng)頁信息實(shí)時(shí)監(jiān)聽程序的設(shè)計(jì)與實(shí)現(xiàn)[J];微計(jì)算機(jī)信息;2005年03期
相關(guān)碩士學(xué)位論文 前1條
1 曲建華;Web上的信息過濾問題研究[D];山東師范大學(xué);2003年
,本文編號(hào):2308442
本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/2308442.html
最近更新
教材專著