面向科技項目的相似度計算和聚類算法研究
本文選題:VSM + 語義理解; 參考:《杭州電子科技大學》2015年碩士論文
【摘要】:隨著我國對科技經(jīng)費投入的逐漸增多,科研單位科技項目的申請也變得越來越多,怎么樣有效的解決項目重復立項問題是現(xiàn)代科技項目管理中非常重要的一部分。傳統(tǒng)的人工查重顯然是不行的,而已有的一些查重系統(tǒng)在精度和速度上都不能滿足要求,因此對項目查重系統(tǒng)關(guān)鍵技術(shù)的研究就變得非常有必要。本文重點對科技項目的表示模型、相似度計算和聚類等技術(shù)進行研究。主要工作包括以下幾個方面:1.根據(jù)科技項目內(nèi)容復雜、信息大的特點,提出一種結(jié)合物元知識表示模型和向量空間模型的科技項目知識表示模型和科技項目關(guān)系模型,方便后續(xù)對科技項目的表示和處理。2.針對科技項目的查重需求,分析總結(jié)了基于向量空間模型的相似度計算方法和基于語義理解的相似度計算方法,在此基礎(chǔ)上提出了一種基于語義理解的VSM相似度計算方法。針對科技項目名稱中含有大量有用信息,字數(shù)較少且含有較多專業(yè)名詞的特點,提出了一種改進的基于編輯距離的句子相似度計算方法。最后把以上兩種方法分別應(yīng)用于科技項目的主要內(nèi)容和項目名稱的相似度計算中,并進行權(quán)重調(diào)整,綜合計算整個科技項目的相似度。3.針對科技項目查重時需把待查項目和已有所有項目進行比對,效率較低的問題,本文先進行項目聚類然后再進行查重。而已有的聚類算法有需要預先輸入?yún)?shù)和算法時間復雜度較高無法應(yīng)用于大型項目庫等問題,本文提出一種基于雙閾值的最近鄰項目聚類算法并應(yīng)用于項目查重系統(tǒng),在不影響查重精度的情況下,提高了查重速度。在以上相似度計算方法和聚類算法研究成果的基礎(chǔ)上,實際應(yīng)用于浙江省科技項目相似度檢測系統(tǒng)中,有效地實現(xiàn)了項目查重功能,并且有良好查重準確度和運行速度,成功驗證了本論文研究成果的可行性。
[Abstract]:With the increasing investment of science and technology funds in our country, the application of scientific and technological projects in scientific research units has become more and more. How to effectively solve the problem of project duplicate establishment is a very important part of modern science and technology project management. It is obvious that the traditional manual checking is not feasible, and some of the existing checking systems can not meet the requirements in accuracy and speed. Therefore, it is necessary to study the key technologies of the item checking and rechecking system. This paper focuses on the representation model of scientific and technological projects, similarity calculation and clustering techniques. The main work includes the following aspects: 1. According to the characteristics of complex contents and large information of scientific and technological projects, a model of knowledge representation of scientific and technological projects and a relational model of scientific and technological projects are proposed in combination with matter-element knowledge representation model and vector space model, which can facilitate the subsequent representation and processing of scientific and technological projects. According to the need of scientific and technological projects, this paper analyzes and summarizes the similarity calculation methods based on vector space model and semantic understanding. Based on this, a VSM similarity calculation method based on semantic understanding is proposed. In view of the fact that the names of scientific and technological projects contain a lot of useful information, fewer words and more professional nouns, an improved sentence similarity calculation method based on editing distance is proposed. Finally, the above two methods are applied to the similarity calculation of the main contents of the science and technology project and the name of the project, and the weight is adjusted to calculate the similarity of the whole science and technology project. 3. In order to solve the problem that it is necessary to compare the items to be checked with all the existing items and the efficiency is low, this paper first clusters the items and then checks them again. However, the existing clustering algorithms need to input parameters in advance and the time complexity of the algorithms can not be applied to large project library. In this paper, a clustering algorithm for nearest neighbor items based on double thresholds is proposed and applied to the item checking system. Under the condition of not affecting the checking accuracy, the checking speed is improved. On the basis of the above research results of similarity calculation method and clustering algorithm, it has been applied to the similarity detection system of science and technology projects in Zhejiang Province. It has effectively realized the function of checking duplicate of items, and has good accuracy and running speed. The feasibility of the research results is verified successfully.
【學位授予單位】:杭州電子科技大學
【學位級別】:碩士
【學位授予年份】:2015
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前10條
1 趙作鵬;尹志民;王潛平;許新征;江海峰;;一種改進的編輯距離算法及其在數(shù)據(jù)處理中的應(yīng)用[J];計算機應(yīng)用;2009年02期
2 呂佳;;基于動態(tài)隧道系統(tǒng)的K-means聚類算法研究[J];重慶師范大學學報(自然科學版);2009年01期
3 高瀅;劉大有;齊紅;劉赫;;一種半監(jiān)督K均值多關(guān)系數(shù)據(jù)聚類算法[J];軟件學報;2008年11期
4 雷小鋒;謝昆青;林帆;夏征義;;一種基于K-Means局部最優(yōu)性的高效聚類算法[J];軟件學報;2008年07期
5 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學報;2008年01期
6 王毅;唐歆瑜;謝治華;;基于向量空間模型的畢業(yè)論文相似性辨識研究[J];科學技術(shù)與工程;2007年09期
7 楊善林;李永森;胡笑旋;潘若愚;;K-MEANS算法中的K值優(yōu)化問題研究[J];系統(tǒng)工程理論與實踐;2006年02期
8 余剛;裴仰軍;朱征宇;陳華月;;基于詞匯語義計算的文本相似度研究[J];計算機工程與設(shè)計;2006年02期
9 金博,史彥軍,滕弘飛;基于語義理解的文本相似度算法[J];大連理工大學學報;2005年02期
10 史彥軍,滕弘飛,金博;抄襲論文識別研究與進展[J];大連理工大學學報;2005年01期
,本文編號:1949205
本文鏈接:http://www.sikaile.net/guanlilunwen/xiangmuguanli/1949205.html