基于改進(jìn)PageRank算法的醫(yī)學(xué)垂直搜索引擎的研究與實(shí)現(xiàn)
[Abstract]:In recent years, the Internet has gradually become an important platform for people to obtain medical health information, in which search engine provides great convenience in the process of searching medical information. However, the existing medical search engines still have some shortcomings in topic similarity judgment and web page sorting algorithms. Therefore, a vertical search engine oriented to medical field is constructed by improving the topic similarity judgment and PageRank algorithm. The main research contents and results are as follows: (1) choose the initial URL, to construct the subject thesaurus of medical field and study the spatial vector model. After crawling the web page, we distinguish the theme correlation from hyperlink, meta-information and thesaurus respectively, and effectively remove the page which is not related to the topic. The efficiency of search engine is greatly improved. (2) the PageRank algorithm and HITS algorithm are studied and analyzed in this paper. Because the PageRank algorithm is more efficient and the amount of computing data is larger, the PageRank algorithm is used as the sorting algorithm for web pages in this paper. Aiming at the shortcomings of PageRank algorithm, such as biased old web pages, average weight distribution, topic drift and so on, time feedback factor is introduced to improve the score of "new" web pages, and authoritative feedback factor is introduced to improve the weights of web pages. The theme correlation factor is introduced to suppress the "topic drift". (3) based on the above two research results, this paper designs a vertical search engine oriented to the medical field. When designing search engine, it is mainly divided into crawler module and retrieval service module. In addition, based on the high extensibility and plug-in mechanism of Nutch, this paper adds IKAnalyzer Chinese word Segmentation to improve the ability of search engine to process Chinese information. (4) finally, the project is deployed and verified. Experiments show that the vertical search engine can segment words by word, and the accuracy of word segmentation reaches 900.The crawler efficiency is improved by 8 percent after the page is judged by the similarity of topic, and the PageRank algorithm is improved. The accuracy of vertical search engine has improved obviously, and the precision rate of the top 10 results returned to users is more than 0.7.
【學(xué)位授予單位】:長(zhǎng)安大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 吳宏洲;;分詞技術(shù)的研究與應(yīng)用——一種快速分詞的實(shí)現(xiàn)[J];電腦知識(shí)與技術(shù);2015年06期
2 高慧;張濤;王付強(qiáng);夏彬;;面向輿情發(fā)現(xiàn)系統(tǒng)的中文語料分詞研究[J];軟件導(dǎo)刊;2015年11期
3 萬曉松;王志海;原繼東;;基于稀疏矩陣面向論文索引排名的啟發(fā)式算法[J];計(jì)算機(jī)應(yīng)用;2015年10期
4 程維剛;王寧;田勇;;基于關(guān)鍵詞匹配技術(shù)的相似試題檢測(cè)方法研究[J];北華航天工業(yè)學(xué)院學(xué)報(bào);2015年03期
5 張吳波;史旅華;李貴榮;;全文檢索引擎Lucene系統(tǒng)模型與應(yīng)用研究[J];軟件導(dǎo)刊;2015年06期
6 陳道存;劉斌;張?chǎng)?;高校FTP搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];蚌埠學(xué)院學(xué)報(bào);2015年03期
7 于娟;劉強(qiáng);;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計(jì)算機(jī)工程與科學(xué);2015年02期
8 高翔;吳萬琴;;人工智能技術(shù)在搜索引擎中的應(yīng)用[J];硅谷;2015年03期
9 張軍強(qiáng);李煒;沈奇威;;一種爬蟲監(jiān)控系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];電信工程技術(shù)與標(biāo)準(zhǔn)化;2014年12期
10 胡宏偉;虞萍;周南;喬軍;;基于Lucene的文獻(xiàn)資料全文檢索系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];重慶理工大學(xué)學(xué)報(bào)(自然科學(xué));2014年11期
相關(guān)碩士學(xué)位論文 前6條
1 王清霞;基于領(lǐng)域本體的垂直搜索引擎頁面排序算法的研究[D];蘭州理工大學(xué);2014年
2 岑沛斯;基于文本分析的互聯(lián)網(wǎng)視頻搜索引擎技術(shù)研究[D];杭州電子科技大學(xué);2013年
3 黃江平;基于Lucene的桌面搜索引擎的研究與應(yīng)用[D];浙江理工大學(xué);2012年
4 朱明強(qiáng);基于詞典和詞頻分析的論壇語料未登錄詞識(shí)別研究[D];西南大學(xué);2012年
5 李宜兵;基于搜索引擎網(wǎng)頁排序算法研究[D];沈陽理工大學(xué);2011年
6 董祥千;搜索引擎設(shè)計(jì)分析與結(jié)果聚類改進(jìn)[D];電子科技大學(xué);2007年
,本文編號(hào):2247438
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2247438.html