基于DOM的網(wǎng)頁凈化方法研究
發(fā)布時間:2018-12-15 00:34
【摘要】: Internet已經(jīng)成為最重要的信息庫。瀏覽Internet會看到網(wǎng)頁中會包含大量和我們關(guān)心內(nèi)容無關(guān)的導(dǎo)航條、廣告信息、版權(quán)信息、以及調(diào)查問卷等。這些不相關(guān)的內(nèi)容嚴重影響了Web信息挖掘的效果。網(wǎng)頁凈化技術(shù)致力于把混亂的網(wǎng)頁內(nèi)容清晰化、結(jié)構(gòu)化、條理化,并清除不相關(guān)的內(nèi)容。網(wǎng)頁凈化技術(shù)已經(jīng)成為Web信息挖掘的關(guān)鍵技術(shù)。 介紹了網(wǎng)頁凈化的相關(guān)技術(shù)及其在Web信息挖掘中的重要作用,研究了目前流行的網(wǎng)頁分割模型,分析了它們的優(yōu)勢和不足。根據(jù)目前商業(yè)網(wǎng)頁的設(shè)計風格是“DIV加CSS”風格,并且網(wǎng)頁設(shè)計師特意把邏輯相關(guān)的信息放到同一個DIV標簽里并用樣式表控制布局這樣一個事實,提出了一種新的網(wǎng)頁分割模型DSS_DOM。該模型識別出網(wǎng)頁中的基本數(shù)據(jù)單元,并劃分出整個網(wǎng)頁的邏輯區(qū)域。研究了基于DSS_DOM模型的網(wǎng)頁凈化算法,該算法分析了網(wǎng)頁噪音的特點,總結(jié)出一套評價準則,通過分配權(quán)重的方式判斷出網(wǎng)頁各個邏輯區(qū)域的重要性,識別出主題區(qū)域和噪音區(qū)域,達到了凈化網(wǎng)頁的目的。 利用開源項目Lucene對凈化后的網(wǎng)頁集建立了索引,在網(wǎng)頁凈化的基礎(chǔ)上實現(xiàn)了搜索功能。實驗證明DSS_DOM模型及其算法減少了Lucene的索引量,提高了Lucene的查準率。把DSS_DOM模型及其算法應(yīng)用于CPCK中文網(wǎng)頁分類器,在網(wǎng)頁凈化的基礎(chǔ)上實現(xiàn)了中文網(wǎng)頁自動分類。實驗結(jié)果表明,DSS_DOM模型及其算法明確了各個網(wǎng)頁的主題和類別,提高了網(wǎng)頁分類的準確性。
[Abstract]:Internet has become the most important information base. The Internet page will contain a large number of navigation bars, advertising information, copyright information, and questionnaires that are not relevant to our concerns. These irrelevant contents seriously affect the effect of Web information mining. Web purification technology aims to clear, structure, organize, and eliminate irrelevant content. Web page purification technology has become the key technology of Web information mining. This paper introduces the relevant technologies of web page purification and its important role in Web information mining, studies the popular web page segmentation models, and analyzes their advantages and disadvantages. Based on the fact that business web pages are currently designed in a "DIV plus CSS" style, and web designers deliberately place logically relevant information in the same DIV tag and use stylesheets to control layout, A new web page segmentation model, DSS_DOM., is proposed. The model identifies the basic data unit in the web page and divides the logical region of the whole web page. A page purification algorithm based on DSS_DOM model is studied in this paper. The algorithm analyzes the characteristics of web page noise, summarizes a set of evaluation criteria, and determines the importance of each logical region of the page by assigning weights. Identify the theme area and noise area, achieve the purpose of purifying the web page. An open source project, Lucene, is used to index the purified web pages, and the search function is realized on the basis of the purification of the web pages. Experimental results show that the DSS_DOM model and its algorithm can reduce the number of Lucene indexes and improve the precision of Lucene. The DSS_DOM model and its algorithm are applied to the CPCK Chinese web page classifier, and the automatic Chinese web page classification is realized on the basis of page purification. The experimental results show that the DSS_DOM model and its algorithm define the topics and categories of each web page and improve the accuracy of web page classification.
【學(xué)位授予單位】:中國石油大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2009
【分類號】:TP393.092
本文編號:2379606
[Abstract]:Internet has become the most important information base. The Internet page will contain a large number of navigation bars, advertising information, copyright information, and questionnaires that are not relevant to our concerns. These irrelevant contents seriously affect the effect of Web information mining. Web purification technology aims to clear, structure, organize, and eliminate irrelevant content. Web page purification technology has become the key technology of Web information mining. This paper introduces the relevant technologies of web page purification and its important role in Web information mining, studies the popular web page segmentation models, and analyzes their advantages and disadvantages. Based on the fact that business web pages are currently designed in a "DIV plus CSS" style, and web designers deliberately place logically relevant information in the same DIV tag and use stylesheets to control layout, A new web page segmentation model, DSS_DOM., is proposed. The model identifies the basic data unit in the web page and divides the logical region of the whole web page. A page purification algorithm based on DSS_DOM model is studied in this paper. The algorithm analyzes the characteristics of web page noise, summarizes a set of evaluation criteria, and determines the importance of each logical region of the page by assigning weights. Identify the theme area and noise area, achieve the purpose of purifying the web page. An open source project, Lucene, is used to index the purified web pages, and the search function is realized on the basis of the purification of the web pages. Experimental results show that the DSS_DOM model and its algorithm can reduce the number of Lucene indexes and improve the precision of Lucene. The DSS_DOM model and its algorithm are applied to the CPCK Chinese web page classifier, and the automatic Chinese web page classification is realized on the basis of page purification. The experimental results show that the DSS_DOM model and its algorithm define the topics and categories of each web page and improve the accuracy of web page classification.
【學(xué)位授予單位】:中國石油大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2009
【分類號】:TP393.092
【引證文獻】
相關(guān)期刊論文 前1條
1 鄒永強;鐘志農(nóng);;一種高效的新聞網(wǎng)頁噪聲過濾方法[J];微型機與應(yīng)用;2011年16期
相關(guān)碩士學(xué)位論文 前7條
1 王樂超;Web環(huán)境下文獻信息的提取與匹配研究[D];大連理工大學(xué);2010年
2 鄒永強;新聞網(wǎng)頁中人物實體關(guān)系提取技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年
3 羅黎敏;基于DOM模型的網(wǎng)頁凈化系統(tǒng)設(shè)計與實現(xiàn)[D];湖南大學(xué);2010年
4 白玉昭;垂直搜索引擎的研究與實現(xiàn)[D];江南大學(xué);2012年
5 方加沛;垂直搜索引擎主要技術(shù)研究[D];暨南大學(xué);2010年
6 陳佳佳;Deep Web數(shù)據(jù)集成研究及其在購書領(lǐng)域中的應(yīng)用[D];暨南大學(xué);2010年
7 莫卓穎;基于語義DOM的WEB信息抽取[D];廣西師范大學(xué);2012年
,本文編號:2379606
本文鏈接:http://www.sikaile.net/wenyilunwen/guanggaoshejilunwen/2379606.html
最近更新
教材專著