基于Python的維吾爾文文本聚類(lèi)系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
[Abstract]:With the rapid development of the Internet, the data information of the Internet is more and more large. How to acquire, manage and use these data quickly and effectively has become an important research content of data mining. As an effective tool to manage and organize text, text clustering has been paid more and more attention and research. Text clustering technology can solve these problems to a certain extent, not only can save time, but also can improve efficiency. There are important applications in the fields of information retrieval, search engine, digital library management and so on. In this paper, we first set up a large-scale text corpus based on the characteristics of Uighur. In order to reduce the dimension of feature space, a preliminary decommissioning thesaurus is constructed from the accumulated text database. In order to reduce the dimension of feature space, the method of word stem extraction is adopted in this paper. The experimental results show that the method can reduce the dimension of the source feature by 23% and 25%. Secondly, the advantages and disadvantages of K-means and GAAC clustering algorithms are deeply studied. An improved K-means algorithm is proposed to overcome the instability of the classical K-means algorithm due to its over-dependence on the initial clustering center and the high time complexity of the GAAC algorithm. The experimental results show that the improved K-means algorithm proposed in this paper is feasible and effective. Finally, the Uighur text clustering system based on python is implemented by using these algorithms. The system consists of three main modules: pretreatment module, text representation module and clustering algorithm module. Compared with the developed system, the accuracy, stability and low time complexity of the improved K-means algorithm are verified. The clustering results show that the system has stable performance.
【學(xué)位授予單位】:新疆大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李文斌;劉椿年;陳嶷瑛;;基于特征信息增益權(quán)重的文本分類(lèi)算法[J];北京工業(yè)大學(xué)學(xué)報(bào);2006年05期
2 林鴻飛,馬雅彬;基于聚類(lèi)的文本過(guò)濾模型[J];大連理工大學(xué)學(xué)報(bào);2002年02期
3 劉艷麗;劉希云;;一種基于密度的K-均值算法[J];計(jì)算機(jī)工程與應(yīng)用;2007年32期
4 范小麗;劉曉霞;;文本分類(lèi)中互信息特征選擇方法的研究[J];計(jì)算機(jī)工程與應(yīng)用;2010年34期
5 劉志勇;耿新青;;基于模糊聚類(lèi)的文本挖掘算法[J];計(jì)算機(jī)工程;2009年05期
6 張文明;吳江;袁小蛟;;基于密度和最近鄰的K-means文本聚類(lèi)算法[J];計(jì)算機(jī)應(yīng)用;2010年07期
7 潘大勝;;基于改進(jìn)的K-means算法的文本聚類(lèi)仿真系統(tǒng)[J];計(jì)算機(jī)仿真;2010年08期
8 龐劍鋒,卜東波,白碩;基于向量空間模型的文本自動(dòng)分類(lèi)系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;2001年09期
9 趙康;陸介平;倪巍偉;王桂平;;一種基于密度的文本聚類(lèi)挖掘算法[J];計(jì)算機(jī)應(yīng)用研究;2009年01期
10 奉國(guó)和;;自動(dòng)文本分類(lèi)技術(shù)研究[J];情報(bào)雜志;2007年12期
相關(guān)碩士學(xué)位論文 前7條
1 韋魯玉;基于Agent的個(gè)性化智能信息檢索系統(tǒng)[D];哈爾濱理工大學(xué);2007年
2 姚清耘;基于向量空間模型的中文文本聚類(lèi)方法的研究[D];上海交通大學(xué);2008年
3 鄭韞e
本文編號(hào):2458444
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2458444.html