藏文網(wǎng)頁除噪技術研究

發(fā)布時間：2018-12-07 17:24

【摘要】： 隨著網(wǎng)絡信息技術的飛速發(fā)展以及藏族地區(qū)計算機應用技術的不斷提高,越來越多的藏文網(wǎng)頁出現(xiàn)在互聯(lián)網(wǎng)中,使我們更多地了解到廣大藏族同胞的文化生活和民風民俗,增進了我們之間的交流,推動了藏族地區(qū)的發(fā)展。然而,在藏文網(wǎng)頁的有用信息周圍往往夾雜著很多噪聲信息,例如彈出的廣告、多余的圖片以及一些無關的鏈接等。這些信息嚴重影響了藏文網(wǎng)頁中有用信息的獲取效率,如何有效地去除這些無用的噪聲信息已經(jīng)成為藏文信息處理領域一個亟待解決的問題。本文分析了大量當前存在的網(wǎng)頁除噪技術以及藏文網(wǎng)頁的內(nèi)容類型,研究了DOM技術的特點和一些主要的操作規(guī)范,在此基礎上提出了一種基于DOM和顯示屬性相結合的藏文網(wǎng)頁除噪技術。本技術通過分析人們在閱讀瀏覽網(wǎng)頁內(nèi)容時的潛在行為,得出了網(wǎng)頁元素從顯示屬性上分塊的特征,使用了一種顯示屬性分塊模型,并通過示例頁面展示了此模型的具體應用,通過把藏文網(wǎng)頁解析成DOM樹結構,結合顯示屬性和分塊模型對頁面內(nèi)容進行分析,經(jīng)過一系列的顯示塊劃分、DOM節(jié)點的合并與刪除、DOM樹簡化對藏文頁面進行去噪處理。本文除噪技術的核心步驟是提取網(wǎng)頁DOM樹節(jié)點的顯示屬性,因此必須實現(xiàn)藏文網(wǎng)頁的DOM解析。在深入研究了大量網(wǎng)頁解析技術的基礎上,本文使用Java程序設計語言在Eclipse平臺上開發(fā)出了一個藏文網(wǎng)頁DOM解析器,可以把一個藏文HTML頁面解析成一棵DOM節(jié)點樹,每個節(jié)點都完整地包含了HTML文檔的標簽屬性,可以根據(jù)需要隨機提取網(wǎng)頁各信息塊的顯示屬性。本解析器還具有簡單的瀏覽器功能,可以直接通過輸入網(wǎng)址來解析一個藏文網(wǎng)頁,也可以通過把網(wǎng)頁源碼下載到本地計算機上進行解析,具有很強的標簽識別和修復能力,適用于大多數(shù)藏文網(wǎng)頁。同時,通過分析藏文網(wǎng)頁信息的特征,本文提出了依據(jù)藏文信息音節(jié)點出現(xiàn)頻率和網(wǎng)頁超鏈率進行噪聲信息塊識別的方法,可以有效地識別出大部分藏文網(wǎng)頁中包含的噪聲信息塊。最后,對保留的有用信息塊進行DOM節(jié)點過濾可以完成對藏文網(wǎng)頁的除噪。經(jīng)過大量測試,本文的除噪技術可以有效地去除藏文網(wǎng)頁中的大多數(shù)噪聲信息,具有很好的實用價值和應用前景。
[Abstract]:With the rapid development of network information technology and the continuous improvement of computer application technology in Tibetan areas, more and more Tibetan web pages appear on the Internet, which makes us know more about the cultural life and folk customs of the Tibetan compatriots. This has enhanced exchanges between us and promoted the development of Tibetan areas. However, the useful information of Tibetan web pages is often surrounded by a lot of noise information, such as pop-up ads, redundant pictures and irrelevant links. This information seriously affects the efficiency of obtaining useful information in Tibetan web pages. How to effectively remove these useless noise information has become an urgent problem in the field of Tibetan information processing. This paper analyzes a large number of existing web page denoising techniques and the content types of Tibetan web pages, and studies the characteristics of DOM technology and some main operating specifications. On this basis, a Tibetan web page denoising technology based on DOM and display attributes is proposed. By analyzing the potential behavior of people when reading and browsing the web content, the technology obtains the feature that the elements of the web page are divided into blocks from the display attributes, and uses a model to divide the display attributes into blocks, and shows the concrete application of the model through an example page. Through parsing Tibetan web pages into DOM tree structure, combining display attribute and block model to analyze the content of the page, after a series of display blocks partition, DOM node merging and deleting, DOM tree simplifies the denoising processing of Tibetan pages. In this paper, the key step of the denoising technique is to extract the display attributes of the DOM tree node of the web page, so it is necessary to realize the DOM parsing of the Tibetan web page. Based on the deep study of a large number of web page parsing techniques, a Tibetan web page DOM parser is developed on the Eclipse platform by using Java programming language, which can parse a Tibetan HTML page into a DOM node tree. Each node contains the label attributes of HTML documents, and it can randomly extract the display attributes of each information block of the web page according to the need. The parser also has a simple browser function, which can directly parse a Tibetan web page by entering a URL, or can be parsed by downloading the source code of the web page to a local computer. It has a strong ability to identify and repair tags. Suitable for most Tibetan web pages. At the same time, by analyzing the characteristics of Tibetan web page information, this paper proposes a method to identify the noise information blocks based on the frequency of syllable points of Tibetan information and the hyperchain rate of web pages. It can effectively identify the noise information blocks contained in most Tibetan web pages. Finally, the DOM node filtering of reserved useful information blocks can eliminate the noise of Tibetan web pages. After a lot of tests, the denoising technology in this paper can effectively remove most of the noise information from Tibetan web pages, which has good practical value and application prospect.
【學位授予單位】：西北民族大學
【學位級別】：碩士
【學位授予年份】：2010
【分類號】：TP393.092

【參考文獻】

相關期刊論文前8條

1 韓家煒,孟小峰,王靜,李盛恩;Web挖掘研究[J];計算機研究與發(fā)展;2001年04期

2 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁主題信息自動提取[J];計算機研究與發(fā)展;2004年10期

3 常育紅,姜哲,朱小燕;基于標記樹表示方法的頁面結構分析[J];計算機工程與應用;2004年16期

4 李朝;彭宏;葉蘇南;張歡;楊親遙;;基于DOM樹的可適應性Web信息抽取[J];計算機科學;2009年07期

5 珠杰;歐珠;格桑多吉;;基于DOM修剪的藏文Web信息提取[J];計算機工程;2008年24期

6 宋睿華,馬少平,陳剛,李景陽;一種提高中文搜索引擎檢索質(zhì)量的HTML解析方法[J];中文信息學報;2003年04期

7 楊曦,高功步;HTML,DHTML,VRML,XML功能分析與比較研究[J];現(xiàn)代電子技術;2003年10期

8 于洪志,喇秉軍,何向真;Web環(huán)境下藏文信息處理技術[J];西北民族大學學報(自然科學版);2005年01期

，

本文編號：2367555

資料下載