面向圖書的垂直搜索引擎的研究與實現(xiàn)

發(fā)布時間：2018-07-05 05:04

本文選題：垂直搜索引擎 + Shark-Search��；參考：《北京工業(yè)大學(xué)》2014年碩士論文

【摘要】：Internet的出現(xiàn)使互聯(lián)網(wǎng)成為了一個重要的信息資源寶庫，網(wǎng)絡(luò)用戶需要利用搜索引擎提供的檢索服務(wù)去查詢想要的信息。傳統(tǒng)的通用搜索引擎可以滿足用戶搜索信息的基本需求，，但是由于通用搜索引擎檢索的范圍寬泛，在返回給用戶的結(jié)果中包含了大量用戶不關(guān)心的信息，用戶不得不對檢索結(jié)果做進一步的過濾操作，這種額外的過濾操作使用戶的檢索體驗變得不好。垂直搜索引擎彌補了這個不足點，相比通用搜索引擎而言它縮小了檢索的范圍，只關(guān)心網(wǎng)絡(luò)中某一領(lǐng)域或者某一主題的信息，從數(shù)據(jù)源頭上保證了用戶檢索的內(nèi)容就是他們所關(guān)心的。同時垂直搜索引擎還對雜亂的網(wǎng)絡(luò)信息進行相應(yīng)的處理，將其中主要的部分抽取出來并以結(jié)構(gòu)化的方式呈現(xiàn)給用戶，使用戶可以迅速發(fā)現(xiàn)最重要的信息。論文首先介紹了搜索引擎的基本概念以及分類，然后介紹了搜索引擎的工作原理。通過對比通用搜索引擎和垂直搜索引擎工作原理的不同點，對垂直搜索引擎涉及的主題網(wǎng)絡(luò)爬蟲、主題相似度判斷等關(guān)鍵技術(shù)進行了介紹與分析。在論文中所做的主要工作包括：相同主題的超鏈接之間在URL結(jié)構(gòu)上具有相似性，根據(jù)這種特性對傳統(tǒng)基于頁面內(nèi)容的Shark-Search主題爬行算法進行了改進，在預(yù)測孩子URL鏈接的優(yōu)先級得分時考慮了URL鏈接的結(jié)構(gòu)特性對優(yōu)先級得分值的影響；對向量空間模型計算頁面相似度進行分析，提出使用二次主題判斷的方法獲得更多的高質(zhì)量的主題相關(guān)網(wǎng)頁；針對圖書元數(shù)據(jù)信息在網(wǎng)頁中的分布特點，結(jié)合解析工具HTMLParser設(shè)計了一個半自動的元數(shù)據(jù)抽取算法；利用全文索引開發(fā)包Lucene實現(xiàn)了一個面向圖書資源的垂直搜索引擎系統(tǒng)的原型，并對Lucene檢索結(jié)果的默認排序進行了自定義擴展。最后對本文實現(xiàn)的主題爬行算法進行了實驗分析，在主題頁面相對集中的規(guī)范的站點中運行效果較好，因為在這類站點中相同主題的URL之間的相似性比較明顯。對實現(xiàn)的面向圖書的垂直搜索系統(tǒng)原型進行驗證，相比通用搜索引擎系統(tǒng)能夠獲得比較精確的檢索結(jié)果，同時對Lucene默認排序進行自定義擴展可以使檢索結(jié)果排序更合理。
[Abstract]:The Internet has made the Internet an important treasure house of information resources . Web users need to use search services provided by search engines to query the desired information . Traditional universal search engines can satisfy the basic requirements of user search information . However , because of the wide range of search by universal search engines , users have to do a further filtering operation on the search results .

This paper introduces the basic concept and classification of the search engine , then introduces the working principle of the search engine . Through comparing the differences between the general search engine and the working principle of the vertical search engine , this paper introduces and analyzes the key technologies such as the topic network crawler and the topic similarity judgment involved in the vertical search engine .
analyzing the similarity degree of the page of the vector space model , and proposing a method for obtaining more high - quality topic - related web pages by using the method of secondary topic judgment ;
In this paper , a semi - automatic meta - data extraction algorithm is designed according to the distribution characteristics of the book metadata information in web pages .
A prototype of a book - oriented vertical search engine system is realized by using full - text index development package Lucene , and the default ordering of Lucene search results is extended .

Finally , the subject crawling algorithm implemented in this paper is experimentally analyzed , and the results are better in the site with the same theme in the theme pages , because the similarity between the URLs of the same subject in this kind of site is more obvious . Compared with the universal search engine system , it is possible to obtain more accurate retrieval results , and meanwhile , the user - defined extension of Lucene ' s default sorting can make the search results more reasonable .
【學(xué)位授予單位】：北京工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2014
【分類號】：TP391.3

【參考文獻】

相關(guān)期刊論文前6條

1 孫立偉;何國輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲技術(shù)的研究[J];電腦知識與技術(shù);2010年15期

2 楊小平,丁浩,黃都培;基于向量空間模型的中文信息檢索技術(shù)研究[J];計算機工程與應(yīng)用;2003年15期

3 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計算機應(yīng)用;2009年S1期

4 王磊;蔣建中;郭軍利;;基于擴展DOM樹的Web頁面信息抽取[J];計算機應(yīng)用與軟件;2007年06期

5 王文鈞;李巍;;垂直搜索引擎的現(xiàn)狀與發(fā)展探究[J];情報科學(xué);2010年03期

6 曹軍;Google的PageRank技術(shù)剖析[J];情報雜志;2002年10期

本文編號：2099033

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2099033.html

上一篇：利用搜索引擎進行高質(zhì)量情報檢索
下一篇：國內(nèi)外圖書館職業(yè)能力研究進展與啟示

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向圖書的垂直搜索引擎的研究與實現(xiàn)