基于啟發(fā)式的釣魚網(wǎng)站檢測(cè)技術(shù)的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2019-01-13 20:22
【摘要】:釣魚網(wǎng)站是在網(wǎng)頁中包含惡意欺騙信息,引誘互聯(lián)網(wǎng)用戶提交個(gè)人信息從而竊取其隱私信息乃至個(gè)人財(cái)產(chǎn)的一種網(wǎng)絡(luò)攻擊方式。為了提高釣魚網(wǎng)站檢測(cè)的準(zhǔn)確性,減少對(duì)第三方工具及資源的依賴性,本文對(duì)釣魚網(wǎng)站啟發(fā)式檢測(cè)技術(shù)以及釣魚頁面主題識(shí)別技術(shù)展開了研究。首先,本文對(duì)網(wǎng)頁內(nèi)容預(yù)處理關(guān)鍵技術(shù)展開研究,在網(wǎng)頁數(shù)據(jù)采集和存儲(chǔ)方面,本文提出了一種更新式存儲(chǔ)策略,定期對(duì)第三方平臺(tái)公布的釣魚網(wǎng)站進(jìn)行信息資源采集。在網(wǎng)頁文本特征獲取方面,則利用針對(duì)網(wǎng)頁文本的m-TextRank文本關(guān)鍵詞抽取算法對(duì)網(wǎng)頁文本信息特征進(jìn)行抽取及儲(chǔ)存。其次,為提高釣魚檢測(cè)的精確度和穩(wěn)定性,本文通過及時(shí)識(shí)別新特征和精確選擇最佳特征子集的方式來優(yōu)化檢測(cè)方案,并提出了一種多層啟發(fā)式釣魚網(wǎng)站檢測(cè)模型包括特征提取層、特征選擇層以及啟發(fā)式分類層。該模型利用五個(gè)特征選擇算法來預(yù)處理特征集,并研究了三種基于決策樹的分類算法的性能與效果。實(shí)驗(yàn)結(jié)果表明,使用信息增益算法進(jìn)行特征選擇并結(jié)合隨機(jī)樹分類算法的釣魚網(wǎng)站檢測(cè)方法能夠在低時(shí)間開銷下達(dá)到96%的準(zhǔn)確率和95%的召回率。再次,為了研究網(wǎng)頁主題和網(wǎng)頁合法性的相關(guān)性以及釣魚網(wǎng)站的主題分布情況,本文提出了基于LDA-SVM的釣魚網(wǎng)頁主題識(shí)別算法。該算法通過對(duì)網(wǎng)頁文本內(nèi)容進(jìn)行預(yù)處理、Gibbs抽樣、LDA建模、SVM分類、效果評(píng)估等步驟建立LDA-SVM主題分類模型從而實(shí)現(xiàn)對(duì)網(wǎng)頁主題的識(shí)別。經(jīng)實(shí)驗(yàn)驗(yàn)證,釣魚網(wǎng)站的主題識(shí)別準(zhǔn)確率可達(dá)93%。隨后本文根據(jù)上述主題分類模型對(duì)經(jīng)過啟發(fā)式檢測(cè)的網(wǎng)站進(jìn)行主題鑒別,為啟發(fā)式釣魚網(wǎng)站的檢測(cè)結(jié)果提供佐證。最后,在上述研究基礎(chǔ)上,本文設(shè)計(jì)并實(shí)現(xiàn)了釣魚網(wǎng)站啟發(fā)式檢測(cè)系統(tǒng)。該系統(tǒng)主要提供網(wǎng)頁信息采集、合法性檢測(cè)以及網(wǎng)頁主題識(shí)別的功能。系統(tǒng)測(cè)試結(jié)果表明,系統(tǒng)能夠滿足對(duì)未知網(wǎng)站的合法性檢測(cè)需求,整體滿足預(yù)期目標(biāo)。
[Abstract]:Phishing website is a kind of network attack way that contains malicious cheating information in the web page and induces Internet users to submit personal information to steal their privacy information and even personal property. In order to improve the accuracy of fishing site detection and reduce the dependence on third-party tools and resources, this paper studies the heuristic detection technology of fishing site and the technology of phishing page theme recognition. Firstly, this paper studies the key technologies of web content preprocessing. In the aspect of data acquisition and storage, this paper proposes a new storage strategy to collect information resources of phishing websites published by the third party platform periodically. In the aspect of web page text feature extraction, the m-TextRank text keyword extraction algorithm is used to extract and store the web page text information feature. Secondly, in order to improve the accuracy and stability of fishing detection, this paper optimizes the detection scheme by identifying new features in time and selecting the best feature subset accurately. A multi-layer heuristic phishing site detection model is proposed, which includes feature extraction layer, feature selection layer and heuristic classification layer. The model uses five feature selection algorithms to preprocess feature sets, and studies the performance and effect of three classification algorithms based on decision tree. The experimental results show that the fishing site detection method based on information gain algorithm and random tree classification algorithm can achieve 96% accuracy and 95% recall rate in low time cost. Thirdly, in order to study the correlation between the topic and the legitimacy of the web page and the distribution of the topic of the phishing website, this paper proposes a phishing page theme recognition algorithm based on LDA-SVM. The algorithm establishes the LDA-SVM topic classification model by preprocessing the web text content, Gibbs sampling, LDA modeling, SVM classification and effect evaluation. After experimental verification, fishing site theme recognition accuracy can be as high as 933. Then, according to the above topic classification model, the subject identification of heuristic websites is carried out to provide evidence for the detection results of heuristic phishing websites. Finally, on the basis of the above research, this paper designs and implements a heuristic detection system for fishing websites. The system mainly provides the functions of web page information collection, legitimacy detection and page theme recognition. The system test results show that the system can meet the legitimacy of the unknown website detection requirements, the overall satisfaction of the expected objectives.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP393.08
本文編號(hào):2408362
[Abstract]:Phishing website is a kind of network attack way that contains malicious cheating information in the web page and induces Internet users to submit personal information to steal their privacy information and even personal property. In order to improve the accuracy of fishing site detection and reduce the dependence on third-party tools and resources, this paper studies the heuristic detection technology of fishing site and the technology of phishing page theme recognition. Firstly, this paper studies the key technologies of web content preprocessing. In the aspect of data acquisition and storage, this paper proposes a new storage strategy to collect information resources of phishing websites published by the third party platform periodically. In the aspect of web page text feature extraction, the m-TextRank text keyword extraction algorithm is used to extract and store the web page text information feature. Secondly, in order to improve the accuracy and stability of fishing detection, this paper optimizes the detection scheme by identifying new features in time and selecting the best feature subset accurately. A multi-layer heuristic phishing site detection model is proposed, which includes feature extraction layer, feature selection layer and heuristic classification layer. The model uses five feature selection algorithms to preprocess feature sets, and studies the performance and effect of three classification algorithms based on decision tree. The experimental results show that the fishing site detection method based on information gain algorithm and random tree classification algorithm can achieve 96% accuracy and 95% recall rate in low time cost. Thirdly, in order to study the correlation between the topic and the legitimacy of the web page and the distribution of the topic of the phishing website, this paper proposes a phishing page theme recognition algorithm based on LDA-SVM. The algorithm establishes the LDA-SVM topic classification model by preprocessing the web text content, Gibbs sampling, LDA modeling, SVM classification and effect evaluation. After experimental verification, fishing site theme recognition accuracy can be as high as 933. Then, according to the above topic classification model, the subject identification of heuristic websites is carried out to provide evidence for the detection results of heuristic phishing websites. Finally, on the basis of the above research, this paper designs and implements a heuristic detection system for fishing websites. The system mainly provides the functions of web page information collection, legitimacy detection and page theme recognition. The system test results show that the system can meet the legitimacy of the unknown website detection requirements, the overall satisfaction of the expected objectives.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP393.08
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 裴英博;劉曉霞;;文本分類中改進(jìn)型CHI特征選擇方法的研究[J];計(jì)算機(jī)工程與應(yīng)用;2011年04期
2 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁主題信息自動(dòng)提取[J];計(jì)算機(jī)研究與發(fā)展;2004年10期
相關(guān)碩士學(xué)位論文 前1條
1 史國強(qiáng);基于RBF神經(jīng)網(wǎng)絡(luò)的網(wǎng)頁分類技術(shù)研究[D];中國石油大學(xué);2011年
,本文編號(hào):2408362
本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/2408362.html
最近更新
教材專著