面向開源軟件的聚類搜索系統(tǒng)設(shè)計與實現(xiàn)
本文選題:開源軟件 + 聚類搜索; 參考:《國防科學技術(shù)大學》2012年碩士論文
【摘要】:利用開源軟件來提高軟件的開發(fā)效率和質(zhì)量,已成為在軟件工程領(lǐng)域的重要發(fā)展趨勢。隨著開源軟件的快速發(fā)展和廣泛應(yīng)用,互聯(lián)網(wǎng)上出現(xiàn)了大量面向開源軟件開發(fā)和共享的開源社區(qū)。目前,,種類繁多、數(shù)量巨大的開源軟件廣泛分布于互聯(lián)網(wǎng)的眾多開源社區(qū),這對開源軟件的搜索和選擇帶來嚴峻挑戰(zhàn)。如何自動收集和檢索互聯(lián)網(wǎng)開源社區(qū)中的海量開源數(shù)據(jù),并對檢索到的數(shù)據(jù)結(jié)果進行聚類分析,為用戶提供一種面向開源軟件的跨社區(qū)聚類搜索服務(wù),是具有重要研究和實踐價值的課題。 本文深入分析了搜索引擎和聚類搜索相關(guān)技術(shù),針對開源軟件數(shù)據(jù)在互聯(lián)網(wǎng)上的分布規(guī)律和數(shù)據(jù)特點,設(shè)計了面向開源社區(qū)數(shù)據(jù)爬取、屬性抽取與索引、搜索結(jié)果聚類分析的開源軟件搜索系統(tǒng)Influx,能夠有效支持開源軟件的跨社區(qū)聚類搜索。本文的工作主要包括: 首先,本文對搜索引擎和聚類搜索相關(guān)技術(shù)進行了比較分析,針對開源社區(qū)搜索系統(tǒng)的特殊需求,提出一種面向開源軟件的聚類搜索系統(tǒng)體系結(jié)構(gòu)Influx,將此類聚類搜索系統(tǒng)結(jié)構(gòu)劃分為數(shù)據(jù)存儲、數(shù)據(jù)檢索、數(shù)據(jù)分析和數(shù)據(jù)訪問四個層次,具有良好可擴展性。 其次,設(shè)計了開源軟件聚類搜索系統(tǒng)的信息檢索機制和聚類分析機制。其中,基于Heritrix和Lucene平臺設(shè)計了高效的開源軟件信息爬取、信息抽取和屬性索引機制,基于K-means算法設(shè)計一種改良的搜索結(jié)果聚類機制,以供用戶選擇性的瀏覽搜索結(jié)果。 最后,實現(xiàn)了面向開源軟件的搜索系統(tǒng)Influx并進行了實驗,對系統(tǒng)功能和性能進行了驗證。實驗結(jié)果表明,Influx搜索系統(tǒng)能夠有效支持在互聯(lián)網(wǎng)范圍進行跨社區(qū)開源軟件搜索和搜索結(jié)果的聚類分析。
[Abstract]:The use of open source software to improve the efficiency and quality of software development has become an important development trend in the field of software engineering. With the rapid development and wide application of open source software, there are a large number of open source communities for open source software development and sharing on the Internet. At present, a wide variety of open source software is widely distributed in many open source communities on the Internet, which brings serious challenges to the search and selection of open source software. How to automatically collect and retrieve the massive open source data in the open source community of the Internet, and analyze the result of the data retrieval, so as to provide users with a cross-community clustering search service oriented to open source software. Is an important research and practical value of the subject. In this paper, the related technologies of search engine and clustering search are deeply analyzed. According to the distribution rule and data characteristics of open source software data on the Internet, this paper designs a method for data crawling, attribute extraction and indexing in open source community. Influx, an open source software search system for clustering analysis of search results, can effectively support cross-community clustering search of open source software. The work of this paper mainly includes: First of all, this paper makes a comparative analysis of search engine and cluster search technology, aiming at the special needs of open source community search system. A cluster search system architecture named Influx for open source software is proposed. The cluster search system is divided into four levels: data storage, data retrieval, data analysis and data access. Secondly, the information retrieval mechanism and clustering analysis mechanism of open source clustering search system are designed. Among them, based on Heritrix and Lucene platform, an efficient open source software information crawling, information extraction and attribute indexing mechanism is designed. Based on K-means algorithm, an improved search result clustering mechanism is designed for users to browse search results selectively. Finally, the open source software oriented search system Influx is implemented and tested, and the function and performance of the system are verified. The experimental results show that the Influx search system can effectively support cross-community open source software search and clustering analysis of search results.
【學位授予單位】:國防科學技術(shù)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP311.52
【參考文獻】
相關(guān)期刊論文 前10條
1 趙洋;滕桂法;張玉新;何冬梅;;基于Internet的農(nóng)業(yè)信息垂直搜索引擎的設(shè)計[J];河北農(nóng)業(yè)大學學報;2009年06期
2 魯明羽;姚曉娜;魏善嶺;;基于模糊聚類的網(wǎng)絡(luò)論壇熱點話題挖掘[J];大連海事大學學報;2008年04期
3 劉輝,葉紹志,黃暉,李星;基于搜索引擎的IPv6網(wǎng)絡(luò)分析[J];電信科學;2002年03期
4 謝欣,劉菲菲,李曉明;天網(wǎng)千帆——一種新型文件搜索引擎[J];華南理工大學學報(自然科學版);2004年S1期
5 朱岸青;黃杰;;基于Lucene的全文檢索系統(tǒng)模型的研究和開發(fā)[J];暨南大學學報(自然科學與醫(yī)學版);2009年05期
6 李曉麗;杜振龍;;基于Lucence的個性化搜索引擎研究[J];計算機工程;2010年19期
7 熊瑞萍;萬江平;;開源軟件的突圍之路——關(guān)于開源運動的若干思考[J];科技管理研究;2009年03期
8 李丹;顧保磊;;基于Heritrix的內(nèi)容搜索引擎系統(tǒng)[J];軟件導(dǎo)刊;2010年04期
9 楊頌;歐陽柳波;;基于Heritrix的面向電子商務(wù)網(wǎng)站增量爬蟲研究[J];軟件導(dǎo)刊;2010年07期
10 曹紅兵;;新一代搜索引擎UJIK0[J];圖書館建設(shè);2007年02期
本文編號:1916218
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1916218.html