天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

共現(xiàn)潛在語(yǔ)義向量空間模型的進(jìn)一步研究

發(fā)布時(shí)間:2018-01-26 05:58

  本文關(guān)鍵詞: 向量空間模型 CLSVSM TCLSVSM 共現(xiàn)分析 聚類(lèi) 出處:《情報(bào)雜志》2017年12期  論文類(lèi)型:期刊論文


【摘要】:[目的/意義]文獻(xiàn)的向量表示是文獻(xiàn)聚類(lèi)的首要任務(wù)。共現(xiàn)潛在語(yǔ)義向量空間模型(CLSVSM)通過(guò)共現(xiàn)分析挖掘特征詞對(duì)間的最大潛在語(yǔ)義信息對(duì)向量空間模型(VSM)進(jìn)行了語(yǔ)義補(bǔ)充,與向量空間模型相比明顯提高了中文文獻(xiàn)的聚類(lèi)性能。然而,對(duì)該模型的研究還有待深入:該模型對(duì)英文文獻(xiàn)的聚類(lèi)適用性尚需檢驗(yàn);是否可以考慮利用除max統(tǒng)計(jì)量以外的其它統(tǒng)計(jì)量構(gòu)建模型?聚類(lèi)效果又會(huì)如何?面對(duì)大量的文獻(xiàn)數(shù)據(jù),模型的維度往往較高,運(yùn)算成本大,所以有必要對(duì)模型進(jìn)行優(yōu)化處理。[方法/過(guò)程]首先將CLSVSM用于對(duì)英文文獻(xiàn)集(數(shù)據(jù)來(lái)源于Web of Science,簡(jiǎn)記為WOS)的主題聚類(lèi)并與VSM的聚類(lèi)結(jié)果進(jìn)行比較;然后利用除max統(tǒng)計(jì)量以外的三個(gè)常用統(tǒng)計(jì)量min,ave,med構(gòu)建相應(yīng)的CLSVSM模型,并用這四個(gè)統(tǒng)計(jì)量構(gòu)建的CLSVSM模型對(duì)中英文文獻(xiàn)進(jìn)行聚類(lèi)比較。更重要的是,我們提出了截尾共現(xiàn)潛在語(yǔ)義向量空間模型(TCLSVSM)并檢驗(yàn)其聚類(lèi)性能。[結(jié)果/結(jié)論]實(shí)驗(yàn)顯示:CLSVSM對(duì)英文文獻(xiàn)聚類(lèi)同樣適用;四種統(tǒng)計(jì)量構(gòu)建的模型中CLSVSM-max對(duì)中英文文獻(xiàn)的聚類(lèi)效果最佳;TCLSVSM不僅能保證聚類(lèi)性能,而且能顯著降低運(yùn)算成本。
[Abstract]:[Objective / meaning] the vector representation of literature is the primary task of document clustering. The latent semantic Vector Space Model (CLSVSM). The maximum potential semantic information between feature pairs is extracted by co-occurrence analysis to complement the vector space model (VSM). Compared with the vector space model, the clustering performance of Chinese literature is improved obviously. However, the research on this model needs to be further studied: the applicability of the model to English literature clustering needs to be tested; Could you consider using statistics other than max statistics to build models? What is the effect of clustering? In the face of a large amount of literature data, the dimension of the model is often high and the operation cost is large, so it is necessary to optimize the model. [Methods / procedures] first, CLSVSM was used in the English literature set (data from Web of Science). The topic clustering is abbreviated as WOS) and compared with the clustering results of VSM. Then, the corresponding CLSVSM model was constructed by using the three commonly used statistics except max statistics. The CLSVSM model constructed by these four statistics is used to cluster and compare Chinese and English literature. We propose a truncated cooccurrence latent semantic vector space model (TCLSVSM) and test its clustering performance. [Results / conclusion] the experiment showed that: 1. CLSVSM was also applicable to English literature clustering. Among the four statistical models, CLSVSM-max has the best clustering effect on Chinese and English literature. TCLSVSM can not only guarantee the clustering performance, but also reduce the operation cost significantly.
【作者單位】: 山西大學(xué)數(shù)學(xué)科學(xué)學(xué)院;山西大學(xué)管理與決策研究所;
【基金】:國(guó)家自然科學(xué)基金項(xiàng)目“共現(xiàn)潛在語(yǔ)義向量空間模型及其語(yǔ)義核的構(gòu)建與應(yīng)用研究”(編號(hào):71503151) 山西省高等學(xué)校創(chuàng)新人才支持計(jì)劃“基于潛在語(yǔ)義的文本信息主題深度聚類(lèi)研究”(編號(hào):2016052006)的研究成果之一
【分類(lèi)號(hào)】:G353.1;TP391.1
【正文快照】: 0引言大數(shù)據(jù)時(shí)代使得信息資源空前豐富,其中絕大多數(shù)是文本信息資源。如何有效處理這些信息是文本挖掘、信息檢索等領(lǐng)域研究的重點(diǎn)問(wèn)題。文本信息資源不同于一般的數(shù)據(jù)資源,其一,文本數(shù)據(jù)是一種半結(jié)構(gòu)或無(wú)結(jié)構(gòu)的數(shù)據(jù);其二,文本數(shù)據(jù)中包含大量的語(yǔ)義信息;傳統(tǒng)的數(shù)據(jù)挖掘算法無(wú)

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 丁月華,文貴華,郭煒強(qiáng);基于核向量空間模型的專利分類(lèi)[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年08期

2 王萌,何婷婷,張偉;基于概念向量空間模型的中文自動(dòng)文摘系統(tǒng)[J];計(jì)算機(jī)工程與應(yīng)用;2005年01期

3 張玉連;張敏;張波;;一種擴(kuò)展的向量空間模型-隱含語(yǔ)義索引模型研究[J];燕山大學(xué)學(xué)報(bào);2006年01期

4 李雪峰;劉魯;張f,

本文編號(hào):1464875


資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1464875.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶feec2***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com