隱藏型垃圾網(wǎng)頁檢測研究
發(fā)布時間:2018-07-28 07:59
【摘要】:垃圾網(wǎng)頁是指網(wǎng)頁制造者采用迷惑或欺騙搜索引擎的手段,使得網(wǎng)頁在檢索結(jié)果中的排名高于實際排名的行為。這種頁面不僅影響搜索引擎檢索的準確率和效率,也嚴重惡化了用戶的搜索體驗,被公認為互聯(lián)網(wǎng)檢索面臨的最大挑戰(zhàn)之一。在垃圾網(wǎng)頁作弊技術(shù)中,隱藏型作弊具有隱蔽性、欺詐性和難以檢測等特點,已成為垃圾網(wǎng)頁檢測中一個亟待解決的問題。 本文綜述了目前國內(nèi)外隱藏型垃圾網(wǎng)頁檢測技術(shù)的研究現(xiàn)狀,介紹了隱藏型作弊技術(shù)的類型和特點?偨Y(jié)歸納偽裝型垃圾網(wǎng)頁的現(xiàn)象,詳細介紹偽裝型垃圾網(wǎng)頁的實現(xiàn)機理以及國內(nèi)外針對隱藏型垃圾網(wǎng)頁的檢測技術(shù)。 本文根據(jù)己總結(jié)的偽裝型垃圾網(wǎng)頁的七種現(xiàn)象,提出了基于類型的Cloaking檢測算法,設(shè)計了偽裝型垃圾網(wǎng)頁的檢測系統(tǒng)框架。該框架包括數(shù)據(jù)集獲取、網(wǎng)頁特征信息提取、Cloaking檢測和文件管理四大模塊。其中數(shù)據(jù)集獲取模塊對模擬搜索引擎爬蟲和用戶瀏覽器獲取搜索結(jié)果進行了詳細的介紹,網(wǎng)頁特征信息提取模塊對特定標簽以及內(nèi)容和鏈接特征的有效性進行了詳細的分析,Cloaking檢測模塊實現(xiàn)已提出的Cloaking檢測算法,選取樸素貝葉斯算法對復(fù)雜Cloaking進行分類檢測,并與幾種常見的分類算法進行實驗結(jié)果對比。文件管理模塊實現(xiàn)對系統(tǒng)文件的管理。 本文構(gòu)建了中文垃圾詞匯庫和偽裝型垃圾網(wǎng)頁的中文樣本數(shù)據(jù)集,通過實驗對偽裝型網(wǎng)頁檢測算法進行驗證,并對實驗結(jié)果進行了詳細的分析。
[Abstract]:Garbage web page refers to the web maker's use of bewildered or deceptive search engines to make web pages ranking higher than the actual rankings in the retrieval results. This page not only affects the accuracy and efficiency of search engine retrieval, but also seriously worsens the user's search experience. It is recognized as the biggest challenge facing Internet retrieval. In the spam web cheating technology, hidden cheating has the characteristics of concealment, fraudulent and difficult to detect. It has become a problem to be solved urgently in the detection of garbage web pages.
This paper summarizes the current research status of hidden spam web detection technology at home and abroad, introduces the types and characteristics of hidden spam technology, summarizes the phenomenon of disguised garbage web pages, introduces the realization mechanism of disguised garbage pages in detail and the detection techniques for hidden garbage web pages at home and abroad.
In this paper, based on the seven phenomena of disguised spam page, this paper proposes a type based Cloaking detection algorithm, and designs a framework for detection system of disguised garbage pages. This framework includes four modules: data collection, Web feature information extraction, Cloaking detection and file management. The data set acquisition module is used for simulation search. The search results of engine crawlers and user browsers are introduced in detail. The effectiveness of Web feature information extraction module on specific labels, content and link features is analyzed in detail. The Cloaking detection module implements the proposed Cloaking detection algorithm, and selects the naive Bayes algorithm to classify complex Cloaking. The experiment results are compared with several common classification algorithms. The file management module implements the management of system files.
In this paper, the Chinese garbage vocabulary database and the Chinese sample data set of disguised garbage web pages are constructed. The experiment is used to verify the camouflage web page detection algorithm, and the experimental results are analyzed in detail.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.092
本文編號:2149431
[Abstract]:Garbage web page refers to the web maker's use of bewildered or deceptive search engines to make web pages ranking higher than the actual rankings in the retrieval results. This page not only affects the accuracy and efficiency of search engine retrieval, but also seriously worsens the user's search experience. It is recognized as the biggest challenge facing Internet retrieval. In the spam web cheating technology, hidden cheating has the characteristics of concealment, fraudulent and difficult to detect. It has become a problem to be solved urgently in the detection of garbage web pages.
This paper summarizes the current research status of hidden spam web detection technology at home and abroad, introduces the types and characteristics of hidden spam technology, summarizes the phenomenon of disguised garbage web pages, introduces the realization mechanism of disguised garbage pages in detail and the detection techniques for hidden garbage web pages at home and abroad.
In this paper, based on the seven phenomena of disguised spam page, this paper proposes a type based Cloaking detection algorithm, and designs a framework for detection system of disguised garbage pages. This framework includes four modules: data collection, Web feature information extraction, Cloaking detection and file management. The data set acquisition module is used for simulation search. The search results of engine crawlers and user browsers are introduced in detail. The effectiveness of Web feature information extraction module on specific labels, content and link features is analyzed in detail. The Cloaking detection module implements the proposed Cloaking detection algorithm, and selects the naive Bayes algorithm to classify complex Cloaking. The experiment results are compared with several common classification algorithms. The file management module implements the management of system files.
In this paper, the Chinese garbage vocabulary database and the Chinese sample data set of disguised garbage web pages are constructed. The experiment is used to verify the camouflage web page detection algorithm, and the experimental results are analyzed in detail.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.092
【參考文獻】
相關(guān)期刊論文 前3條
1 張帆,朱紅濤;基于關(guān)鍵詞的網(wǎng)絡(luò)信息檢索優(yōu)化探索[J];情報科學(xué);2005年06期
2 李智超;余慧佳;劉奕群;馬少平;;網(wǎng)頁作弊與反作弊技術(shù)綜述[J];山東大學(xué)學(xué)報(理學(xué)版);2011年05期
3 劉衛(wèi)紅;方衛(wèi)東;董守斌;張凌;;基于內(nèi)容與鏈接特征的中文垃圾網(wǎng)頁分類[J];微計算機信息;2010年09期
相關(guān)碩士學(xué)位論文 前2條
1 段晶;樸素貝葉斯分類及其應(yīng)用研究[D];大連海事大學(xué);2011年
2 石占偉;垃圾頁面檢測及其在垂直搜索引擎中的應(yīng)用[D];燕山大學(xué);2010年
,本文編號:2149431
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2149431.html
最近更新
教材專著